* [RFC PATCH net-next v4 1/2] macb: Add 1588 support in Cadence GEM.
From: Andrei Pistirica @ 2016-12-14 12:56 UTC (permalink / raw)
To: netdev, linux-kernel, linux-arm-kernel, davem, nicolas.ferre,
harinikatakamlinux, harini.katakam
Cc: boris.brezillon, rafalo, alexandre.belloni, michals,
Andrei Pistirica, tbultel, anirudh, punnaia, richardcochran
Cadence GEM provides a 102 bit time counter with 48 bits for seconds,
30 bits for nsecs and 24 bits for sub-nsecs to control 1588 timestamping.
This patch does the following:
- Registers to ptp clock framework
- Timer initialization is done by writing time of day to the timer counter.
- ns increment register is programmed as NSEC_PER_SEC/tsu-clock-rate.
For a 16 bit subns precision, the subns increment equals
remainder of (NS_PER_SEC/TSU_CLK) * (2^16).
- Timestamps are obtained from the TX/RX PTP event/PEER registers.
The timestamp obtained thus is updated in skb for upper layers to access.
- The drivers register functions with ptp to perform time and frequency
adjustment.
- Time adjustment is done by writing to the 1558_ADJUST register.
The controller will read the delta in this register and update the timer
counter register. Alternatively, for large time offset adjustments,
the driver reads the secs and nsecs counter values, adds/subtracts the
delta and updates the timer counter.
- Frequency is adjusted by adjusting addend (8bit nanosecond increment) and
addendsub (16bit increment nanosecond fractions).
The 102bit counter is incremented at nominal frequency with addend and
addendsub values. Each period addend and addendsub values are adjusted
based on ppm drift.
Signed-off-by: Andrei Pistirica <andrei.pistirica@microchip.com>
Signed-off-by: Harini Katakam <harinik@xilinx.com>
---
Patch history:
Version 1:
This patch is based on original Harini's patch, implemented in a
separate file to ease the review/maintanance and integration with
other platforms (e.g. Zynq Ultrascale+ MPSoC).
Feature was tested on SAMA5D2 platform using ptp4l v1.6 from linuxptp
project and also with ptpd2 version 2.3.1. PTP was tested over
IPv4,IPv6 and 802.3 protocols.
In case that macb is compiled as a module, it has been renamed to
cadence-macb.ko to avoid naming confusion in Makefile.
Version 2 modifications:
- bitfields for TSU are named according to SAMA5D2 data sheet
- identify GEM-PTP support based on platform capability
- add spinlock for TSU access
- change macb_ptp_adjfreq and use fewer 64bit divisions
Version 3 modifications:
- new adjfine api with one 64 division for frequency adjustment
(based on Richard's input)
- add maximum adjustment frequency (ppb) based on nominal frequency
- per platform PTP configuration
- cosmetic changes
Note 1: Kbuild uses "select" instead of "imply", and the macb maintainer agreed
to make the change when it will be available in net-next.
Version 4 modifications:
- update adjfine for a better approximation
- add maximum adjustment frequency callback to PTP platform configuraion
Note 1: This driver does not support GEM-GXL!
Note 2: Patch on net-next, on December 14th.
drivers/net/ethernet/cadence/Kconfig | 10 +-
drivers/net/ethernet/cadence/Makefile | 8 +-
drivers/net/ethernet/cadence/macb.h | 118 ++++++++++
drivers/net/ethernet/cadence/macb_ptp.c | 366 ++++++++++++++++++++++++++++++++
4 files changed, 500 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ethernet/cadence/macb_ptp.c
diff --git a/drivers/net/ethernet/cadence/Kconfig b/drivers/net/ethernet/cadence/Kconfig
index f0bcb15..ebbc65f 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -29,6 +29,14 @@ config MACB
support for the MACB/GEM chip.
To compile this driver as a module, choose M here: the module
- will be called macb.
+ will be called cadence-macb.
+
+config MACB_USE_HWSTAMP
+ bool "Use IEEE 1588 hwstamp"
+ depends on MACB
+ default y
+ select PTP_1588_CLOCK
+ ---help---
+ Enable IEEE 1588 Precision Time Protocol (PTP) support for MACB.
endif # NET_CADENCE
diff --git a/drivers/net/ethernet/cadence/Makefile b/drivers/net/ethernet/cadence/Makefile
index 91f79b1..4402d42 100644
--- a/drivers/net/ethernet/cadence/Makefile
+++ b/drivers/net/ethernet/cadence/Makefile
@@ -2,4 +2,10 @@
# Makefile for the Atmel network device drivers.
#
-obj-$(CONFIG_MACB) += macb.o
+cadence-macb-y := macb.o
+
+ifeq ($(CONFIG_MACB_USE_HWSTAMP),y)
+cadence-macb-y += macb_ptp.o
+endif
+
+obj-$(CONFIG_MACB) += cadence-macb.o
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index d67adad..e65e985 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -10,6 +10,9 @@
#ifndef _MACB_H
#define _MACB_H
+#include <linux/ptp_clock.h>
+#include <linux/ptp_clock_kernel.h>
+
#define MACB_GREGS_NBR 16
#define MACB_GREGS_VERSION 2
#define MACB_MAX_QUEUES 8
@@ -131,6 +134,20 @@
#define GEM_RXIPCCNT 0x01a8 /* IP header Checksum Error Counter */
#define GEM_RXTCPCCNT 0x01ac /* TCP Checksum Error Counter */
#define GEM_RXUDPCCNT 0x01b0 /* UDP Checksum Error Counter */
+#define GEM_TISUBN 0x01bc /* 1588 Timer Increment Sub-ns */
+#define GEM_TSH 0x01c0 /* 1588 Timer Seconds High */
+#define GEM_TSL 0x01d0 /* 1588 Timer Seconds Low */
+#define GEM_TN 0x01d4 /* 1588 Timer Nanoseconds */
+#define GEM_TA 0x01d8 /* 1588 Timer Adjust */
+#define GEM_TI 0x01dc /* 1588 Timer Increment */
+#define GEM_EFTSL 0x01e0 /* PTP Event Frame Tx Seconds Low */
+#define GEM_EFTN 0x01e4 /* PTP Event Frame Tx Nanoseconds */
+#define GEM_EFRSL 0x01e8 /* PTP Event Frame Rx Seconds Low */
+#define GEM_EFRN 0x01ec /* PTP Event Frame Rx Nanoseconds */
+#define GEM_PEFTSL 0x01f0 /* PTP Peer Event Frame Tx Secs Low */
+#define GEM_PEFTN 0x01f4 /* PTP Peer Event Frame Tx Ns */
+#define GEM_PEFRSL 0x01f8 /* PTP Peer Event Frame Rx Sec Low */
+#define GEM_PEFRN 0x01fc /* PTP Peer Event Frame Rx Ns */
#define GEM_DCFG1 0x0280 /* Design Config 1 */
#define GEM_DCFG2 0x0284 /* Design Config 2 */
#define GEM_DCFG3 0x0288 /* Design Config 3 */
@@ -174,6 +191,7 @@
#define MACB_NCR_TPF_SIZE 1
#define MACB_TZQ_OFFSET 12 /* Transmit zero quantum pause frame */
#define MACB_TZQ_SIZE 1
+#define MACB_SRTSM_OFFSET 15
/* Bitfields in NCFGR */
#define MACB_SPD_OFFSET 0 /* Speed */
@@ -319,6 +337,32 @@
#define MACB_PTZ_SIZE 1
#define MACB_WOL_OFFSET 14 /* Enable wake-on-lan interrupt */
#define MACB_WOL_SIZE 1
+#define MACB_DRQFR_OFFSET 18 /* PTP Delay Request Frame Received */
+#define MACB_DRQFR_SIZE 1
+#define MACB_SFR_OFFSET 19 /* PTP Sync Frame Received */
+#define MACB_SFR_SIZE 1
+#define MACB_DRQFT_OFFSET 20 /* PTP Delay Request Frame Transmitted */
+#define MACB_DRQFT_SIZE 1
+#define MACB_SFT_OFFSET 21 /* PTP Sync Frame Transmitted */
+#define MACB_SFT_SIZE 1
+#define MACB_PDRQFR_OFFSET 22 /* PDelay Request Frame Received */
+#define MACB_PDRQFR_SIZE 1
+#define MACB_PDRSFR_OFFSET 23 /* PDelay Response Frame Received */
+#define MACB_PDRSFR_SIZE 1
+#define MACB_PDRQFT_OFFSET 24 /* PDelay Request Frame Transmitted */
+#define MACB_PDRQFT_SIZE 1
+#define MACB_PDRSFT_OFFSET 25 /* PDelay Response Frame Transmitted */
+#define MACB_PDRSFT_SIZE 1
+#define MACB_SRI_OFFSET 26 /* TSU Seconds Register Increment */
+#define MACB_SRI_SIZE 1
+
+/* Timer increment fields */
+#define MACB_TI_CNS_OFFSET 0
+#define MACB_TI_CNS_SIZE 8
+#define MACB_TI_ACNS_OFFSET 8
+#define MACB_TI_ACNS_SIZE 8
+#define MACB_TI_NIT_OFFSET 16
+#define MACB_TI_NIT_SIZE 8
/* Bitfields in MAN */
#define MACB_DATA_OFFSET 0 /* data */
@@ -386,6 +430,17 @@
#define GEM_PBUF_LSO_OFFSET 27
#define GEM_PBUF_LSO_SIZE 1
+/* Bitfields in TISUBN */
+#define GEM_SUBNSINCR_OFFSET 0
+#define GEM_SUBNSINCR_SIZE 16
+
+/* Bitfields in TI */
+#define GEM_NSINCR_OFFSET 0
+#define GEM_NSINCR_SIZE 8
+
+/* Bitfields in ADJ */
+#define GEM_ADDSUB_OFFSET 31
+#define GEM_ADDSUB_SIZE 1
/* Constants for CLK */
#define MACB_CLK_DIV8 0
#define MACB_CLK_DIV16 1
@@ -417,6 +472,7 @@
#define MACB_CAPS_GIGABIT_MODE_AVAILABLE 0x20000000
#define MACB_CAPS_SG_DISABLED 0x40000000
#define MACB_CAPS_MACB_IS_GEM 0x80000000
+#define MACB_CAPS_GEM_HAS_PTP 0x00000020
/* LSO settings */
#define MACB_LSO_UFO_ENABLE 0x01
@@ -782,6 +838,20 @@ struct macb_or_gem_ops {
int (*mog_rx)(struct macb *bp, int budget);
};
+/* MACB-PTP interface: adapt to platform needs and GEM (e.g. GXL). */
+struct macb_ptp_info {
+ void (*ptp_init)(struct net_device *ndev);
+ void (*ptp_remove)(struct net_device *ndev);
+ s32 (*get_ptp_max_adj)(void);
+ unsigned int (*get_tsu_rate)(struct macb *bp);
+ int (*get_ts_info)(struct net_device *dev,
+ struct ethtool_ts_info *info);
+ int (*get_hwtst)(struct net_device *netdev,
+ struct ifreq *ifr);
+ int (*set_hwtst)(struct net_device *netdev,
+ struct ifreq *ifr, int cmd);
+};
+
struct macb_config {
u32 caps;
unsigned int dma_burst_length;
@@ -874,11 +944,59 @@ struct macb {
unsigned int jumbo_max_len;
u32 wol;
+
+ struct macb_ptp_info *ptp_info;
+#ifdef CONFIG_MACB_USE_HWSTAMP
+ bool hwts_tx_en;
+ bool hwts_rx_en;
+ spinlock_t tsu_clk_lock; /* gem tsu clock locking */
+ unsigned int tsu_rate;
+
+ struct ptp_clock *ptp_clock;
+ struct ptp_clock_info ptp_caps;
+ u32 ns_incr;
+ u32 subns_incr;
+#endif
};
+#ifdef CONFIG_MACB_USE_HWSTAMP
+void gem_ptp_init(struct net_device *ndev);
+void gem_ptp_remove(struct net_device *ndev);
+void gem_ptp_txstamp(struct macb *bp, struct sk_buff *skb);
+void gem_ptp_rxstamp(struct macb *bp, struct sk_buff *skb);
+
+static inline void gem_ptp_do_txstamp(struct macb *bp, struct sk_buff *skb)
+{
+ if (!bp->hwts_tx_en)
+ return;
+
+ return gem_ptp_txstamp(bp, skb);
+}
+
+static inline void gem_ptp_do_rxstamp(struct macb *bp, struct sk_buff *skb)
+{
+ if (!bp->hwts_rx_en)
+ return;
+
+ return gem_ptp_rxstamp(bp, skb);
+}
+
+#else
+static inline void gem_ptp_init(struct net_device *ndev) { }
+static inline void gem_ptp_remove(struct net_device *ndev) { }
+
+static inline void gem_ptp_do_txstamp(struct macb *bp, struct sk_buff *skb) { }
+static inline void gem_ptp_do_rxstamp(struct macb *bp, struct sk_buff *skb) { }
+#endif
+
static inline bool macb_is_gem(struct macb *bp)
{
return !!(bp->caps & MACB_CAPS_MACB_IS_GEM);
}
+static inline bool gem_has_ptp(struct macb *bp)
+{
+ return !!(bp->caps & MACB_CAPS_GEM_HAS_PTP);
+}
+
#endif /* _MACB_H */
diff --git a/drivers/net/ethernet/cadence/macb_ptp.c b/drivers/net/ethernet/cadence/macb_ptp.c
new file mode 100644
index 0000000..6121b2a
--- /dev/null
+++ b/drivers/net/ethernet/cadence/macb_ptp.c
@@ -0,0 +1,366 @@
+/*
+ * 1588 PTP support for GEM device.
+ *
+ * Copyright (C) 2016 Microchip Technology
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/clk.h>
+#include <linux/device.h>
+#include <linux/etherdevice.h>
+#include <linux/platform_device.h>
+#include <linux/time64.h>
+#include <linux/ptp_classify.h>
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <linux/net_tstamp.h>
+
+#include "macb.h"
+
+#define GEM_PTP_TIMER_NAME "gem-ptp-timer"
+
+static inline void gem_tsu_get_time(struct macb *bp,
+ struct timespec64 *ts)
+{
+ u64 sec, sech, secl;
+
+ spin_lock(&bp->tsu_clk_lock);
+
+ /* GEM's internal time */
+ sech = gem_readl(bp, TSH);
+ secl = gem_readl(bp, TSL);
+ ts->tv_nsec = gem_readl(bp, TN);
+ ts->tv_sec = (sech << 32) | secl;
+
+ /* minimize error */
+ sech = gem_readl(bp, TSH);
+ secl = gem_readl(bp, TSL);
+ sec = (sech << 32) | secl;
+ if (ts->tv_sec != sec) {
+ ts->tv_sec = sec;
+ ts->tv_nsec = gem_readl(bp, TN);
+ }
+
+ spin_unlock(&bp->tsu_clk_lock);
+}
+
+static inline void gem_tsu_set_time(struct macb *bp,
+ const struct timespec64 *ts)
+{
+ u32 ns, sech, secl;
+ s64 word_mask = 0xffffffff;
+
+ sech = (u32)ts->tv_sec;
+ secl = (u32)ts->tv_sec;
+ ns = ts->tv_nsec;
+ if (ts->tv_sec > word_mask)
+ sech = (ts->tv_sec >> 32);
+
+ spin_lock(&bp->tsu_clk_lock);
+
+ /* TSH doesn't latch the time and no atomicity! */
+ gem_writel(bp, TN, 0); /* clear to avoid overflow */
+ gem_writel(bp, TSH, sech);
+ gem_writel(bp, TSL, secl);
+ gem_writel(bp, TN, ns);
+
+ spin_unlock(&bp->tsu_clk_lock);
+}
+
+static int gem_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
+{
+ struct macb *bp = container_of(ptp, struct macb, ptp_caps);
+ u32 word, diff;
+ u64 adj, rate;
+ int neg_adj = 0;
+
+ if (scaled_ppm < 0) {
+ neg_adj = 1;
+ scaled_ppm = -scaled_ppm;
+ }
+ rate = scaled_ppm;
+
+ /* word: unused(8bit) | ns(8bit) | fractions(16bit) */
+ word = (bp->ns_incr << 16) + bp->subns_incr;
+
+ adj = word;
+ adj *= rate;
+ adj += 500000UL << 16;
+ adj >>= 16; /* remove fractions */
+ diff = div_u64(adj, 1000000UL);
+ word = neg_adj ? word - diff : word + diff;
+
+ spin_lock(&bp->tsu_clk_lock);
+
+ gem_writel(bp, TISUBN, GEM_BF(SUBNSINCR, (word & 0xffff)));
+ gem_writel(bp, TI, GEM_BF(NSINCR, (word >> 16)));
+
+ spin_unlock(&bp->tsu_clk_lock);
+ return 0;
+}
+
+static int gem_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+ struct macb *bp = container_of(ptp, struct macb, ptp_caps);
+ struct timespec64 now, then = ns_to_timespec64(delta);
+ u32 adj, sign = 0;
+
+ if (delta < 0) {
+ delta = -delta;
+ sign = 1;
+ }
+
+ if (delta > 0x3FFFFFFF) {
+ gem_tsu_get_time(bp, &now);
+
+ if (sign)
+ now = timespec64_sub(now, then);
+ else
+ now = timespec64_add(now, then);
+
+ gem_tsu_set_time(bp, (const struct timespec64 *)&now);
+ } else {
+ adj = delta;
+ if (sign)
+ adj |= GEM_BIT(ADDSUB);
+
+ gem_writel(bp, TA, adj);
+ }
+
+ return 0;
+}
+
+static int gem_ptp_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
+{
+ struct macb *bp = container_of(ptp, struct macb, ptp_caps);
+
+ gem_tsu_get_time(bp, ts);
+
+ return 0;
+}
+
+static int gem_ptp_settime(struct ptp_clock_info *ptp,
+ const struct timespec64 *ts)
+{
+ struct macb *bp = container_of(ptp, struct macb, ptp_caps);
+
+ gem_tsu_set_time(bp, ts);
+
+ return 0;
+}
+
+static int gem_ptp_enable(struct ptp_clock_info *ptp,
+ struct ptp_clock_request *rq, int on)
+{
+ return -EOPNOTSUPP;
+}
+
+static struct ptp_clock_info gem_ptp_caps_template = {
+ .owner = THIS_MODULE,
+ .name = GEM_PTP_TIMER_NAME,
+ .max_adj = 0,
+ .n_alarm = 0,
+ .n_ext_ts = 0,
+ .n_per_out = 0,
+ .n_pins = 0,
+ .pps = 0,
+ .adjfine = gem_ptp_adjfine,
+ .adjtime = gem_ptp_adjtime,
+ .gettime64 = gem_ptp_gettime,
+ .settime64 = gem_ptp_settime,
+ .enable = gem_ptp_enable,
+};
+
+static void gem_ptp_init_timer(struct macb *bp)
+{
+ struct timespec64 now;
+ u32 rem = 0;
+
+ getnstimeofday64(&now);
+ gem_tsu_set_time(bp, (const struct timespec64 *)&now);
+
+ bp->ns_incr = div_u64_rem(NSEC_PER_SEC, bp->tsu_rate, &rem);
+ if (rem) {
+ u64 adj = rem;
+
+ adj <<= 16; /* 16 bits nsec fragments */
+ bp->subns_incr = div_u64(adj, bp->tsu_rate);
+ } else {
+ bp->subns_incr = 0;
+ }
+
+ gem_writel(bp, TISUBN, GEM_BF(SUBNSINCR, bp->subns_incr));
+ gem_writel(bp, TI, GEM_BF(NSINCR, bp->ns_incr));
+ gem_writel(bp, TA, 0);
+}
+
+static void gem_ptp_clear_timer(struct macb *bp)
+{
+ bp->ns_incr = 0;
+ bp->subns_incr = 0;
+
+ gem_writel(bp, TISUBN, GEM_BF(SUBNSINCR, 0));
+ gem_writel(bp, TI, GEM_BF(NSINCR, 0));
+ gem_writel(bp, TA, 0);
+}
+
+/* While GEM can timestamp PTP packets, it does not mark the RX descriptor
+ * to identify them. UDP packets must be parsed to identify PTP packets.
+ *
+ * Note: Inspired from drivers/net/ethernet/ti/cpts.c
+ */
+static int gem_get_ptp_peer(struct sk_buff *skb, int ptp_class)
+{
+ unsigned int offset = 0;
+ u8 *msgtype, *data = skb->data;
+
+ /* PTP frames are rare! */
+ if (likely(ptp_class == PTP_CLASS_NONE))
+ return -1;
+
+ if (ptp_class & PTP_CLASS_VLAN)
+ offset += VLAN_HLEN;
+
+ switch (ptp_class & PTP_CLASS_PMASK) {
+ case PTP_CLASS_IPV4:
+ offset += ETH_HLEN + IPV4_HLEN(data + offset) + UDP_HLEN;
+ break;
+ case PTP_CLASS_IPV6:
+ offset += ETH_HLEN + IP6_HLEN + UDP_HLEN;
+ break;
+ case PTP_CLASS_L2:
+ offset += ETH_HLEN;
+ break;
+
+ /* something went wrong! */
+ default:
+ return -1;
+ }
+
+ if (skb->len + ETH_HLEN < offset + OFF_PTP_SEQUENCE_ID)
+ return -1;
+
+ if (unlikely(ptp_class & PTP_CLASS_V1))
+ msgtype = data + offset + OFF_PTP_CONTROL;
+ else
+ msgtype = data + offset;
+
+ return (*msgtype) & 0x2;
+}
+
+static void gem_ptp_tx_hwtstamp(struct macb *bp, struct sk_buff *skb,
+ int peer_ev)
+{
+ struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+ struct timespec64 ts;
+ u64 ns;
+
+ /* PTP Peer Event Frame packets */
+ if (peer_ev) {
+ ts.tv_sec = gem_readl(bp, PEFTSL);
+ ts.tv_nsec = gem_readl(bp, PEFTN);
+
+ /* PTP Event Frame packets */
+ } else {
+ ts.tv_sec = gem_readl(bp, EFTSL);
+ ts.tv_nsec = gem_readl(bp, EFTN);
+ }
+ ns = timespec64_to_ns(&ts);
+
+ memset(shhwtstamps, 0, sizeof(struct skb_shared_hwtstamps));
+ shhwtstamps->hwtstamp = ns_to_ktime(ns);
+ skb_tstamp_tx(skb, skb_hwtstamps(skb));
+}
+
+static void gem_ptp_rx_hwtstamp(struct macb *bp, struct sk_buff *skb,
+ int peer_ev)
+{
+ struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+ struct timespec64 ts;
+ u64 ns;
+
+ if (peer_ev) {
+ /* PTP Peer Event Frame packets */
+ ts.tv_sec = gem_readl(bp, PEFRSL);
+ ts.tv_nsec = gem_readl(bp, PEFRN);
+ } else {
+ /* PTP Event Frame packets */
+ ts.tv_sec = gem_readl(bp, EFRSL);
+ ts.tv_nsec = gem_readl(bp, EFRN);
+ }
+ ns = timespec64_to_ns(&ts);
+
+ memset(shhwtstamps, 0, sizeof(struct skb_shared_hwtstamps));
+ shhwtstamps->hwtstamp = ns_to_ktime(ns);
+}
+
+/* no static, GEM PTP interface functions */
+void gem_ptp_txstamp(struct macb *bp, struct sk_buff *skb)
+{
+ if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
+ int class = ptp_classify_raw(skb);
+ int peer;
+
+ peer = gem_get_ptp_peer(skb, class);
+ if (peer < 0)
+ return;
+
+ /* Timestamp this packet */
+ gem_ptp_tx_hwtstamp(bp, skb, peer);
+ }
+}
+
+void gem_ptp_rxstamp(struct macb *bp, struct sk_buff *skb)
+{
+ int class, peer;
+
+ __skb_push(skb, ETH_HLEN);
+ class = ptp_classify_raw(skb);
+ __skb_pull(skb, ETH_HLEN);
+
+ peer = gem_get_ptp_peer(skb, class);
+ if (peer < 0)
+ return;
+
+ gem_ptp_rx_hwtstamp(bp, skb, peer);
+}
+
+void gem_ptp_init(struct net_device *ndev)
+{
+ struct macb *bp = netdev_priv(ndev);
+
+ spin_lock_init(&bp->tsu_clk_lock);
+ bp->ptp_caps = gem_ptp_caps_template;
+
+ /* nominal frequency and maximum adjustment in ppb */
+ bp->tsu_rate = bp->ptp_info->get_tsu_rate(bp);
+ bp->ptp_caps.max_adj = bp->ptp_info->get_ptp_max_adj();
+
+ gem_ptp_init_timer(bp);
+
+ bp->ptp_clock = ptp_clock_register(&bp->ptp_caps, NULL);
+ if (IS_ERR(&bp->ptp_clock)) {
+ bp->ptp_clock = NULL;
+ pr_err("ptp clock register failed\n");
+ return;
+ }
+
+ dev_info(&bp->pdev->dev, "%s ptp clock registered.\n",
+ GEM_PTP_TIMER_NAME);
+}
+
+void gem_ptp_remove(struct net_device *ndev)
+{
+ struct macb *bp = netdev_priv(ndev);
+
+ if (bp->ptp_clock)
+ ptp_clock_unregister(bp->ptp_clock);
+
+ gem_ptp_clear_timer(bp);
+
+ dev_info(&bp->pdev->dev, "%s ptp clock unregistered.\n",
+ GEM_PTP_TIMER_NAME);
+}
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Jason A. Donenfeld @ 2016-12-14 12:53 UTC (permalink / raw)
To: David Laight
Cc: Netdev, kernel-hardening, Andi Kleen, LKML,
Linux Crypto Mailing List
In-Reply-To: <20161214035927.30004-3-Jason@zx2c4.com>
Hi David,
On Wed, Dec 14, 2016 at 10:51 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Jason A. Donenfeld
>> Sent: 14 December 2016 00:17
>> This gives a clear speed and security improvement. Rather than manually
>> filling MD5 buffers, we simply create a layout by a simple anonymous
>> struct, for which gcc generates rather efficient code.
> ...
>> + const struct {
>> + struct in6_addr saddr;
>> + struct in6_addr daddr;
>> + __be16 sport;
>> + __be16 dport;
>> + } __packed combined = {
>> + .saddr = *(struct in6_addr *)saddr,
>> + .daddr = *(struct in6_addr *)daddr,
>> + .sport = sport,
>> + .dport = dport
>> + };
>
> You need to look at the effect of marking this (and the other)
> structures 'packed' on architectures like sparc64.
In all current uses of __packed in the code, I think the impact is
precisely zero, because all structures have members in descending
order of size, with each member being a perfect multiple of the one
below it. The __packed is therefore just there for safety, in case
somebody comes in and screws everything up by sticking a u8 in
between. In that case, it wouldn't be desirable to hash the structure
padding bits. In the worst case, I don't believe the impact would be
worse than a byte-by-byte memcpy, which is what the old code did. But
anyway, these structures are already naturally packed anyway, so the
present impact is nil.
Jason
^ permalink raw reply
* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 12:46 UTC (permalink / raw)
To: David Laight
Cc: Netdev, kernel-hardening, Jean-Philippe Aumasson, LKML,
Linux Crypto Mailing List, Daniel J . Bernstein, Linus Torvalds,
Eric Biggers
In-Reply-To: <20161214035927.30004-1-Jason@zx2c4.com>
Hi David,
On Wed, Dec 14, 2016 at 10:56 AM, David Laight <David.Laight@aculab.com> wrote:
> ...
>> +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN])
> ...
>> + u64 k0 = get_unaligned_le64(key);
>> + u64 k1 = get_unaligned_le64(key + sizeof(u64));
> ...
>> + m = get_unaligned_le64(data);
>
> All these unaligned accesses are going to get expensive on architectures
> like sparc64.
Yes, the unaligned accesses aren't pretty. Since in pretty much all
use cases thus far, the data can easily be made aligned, perhaps it
makes sense to create siphash24() and siphash24_unaligned(). Any
thoughts on doing something like that?
Jason
^ permalink raw reply
* Re: [PATCH] vsock: lookup and setup guest_cid inside vhost_vsock_lock
From: Stefan Hajnoczi @ 2016-12-14 12:39 UTC (permalink / raw)
To: Gao feng; +Cc: kvm, netdev, mst
In-Reply-To: <1481714676-18555-1-git-send-email-omarapazanadi@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 414 bytes --]
On Wed, Dec 14, 2016 at 07:24:36PM +0800, Gao feng wrote:
> Multi vsocks may setup the same cid at the same time.
>
> Signed-off-by: Gao feng <omarapazanadi@gmail.com>
> ---
> drivers/vhost/vsock.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
Good catch, a classic time-of-check-to-time-of-use race condition.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply
* Re: Re: Synopsys Ethernet QoS
From: Pavel Machek @ 2016-12-14 11:54 UTC (permalink / raw)
To: Niklas Cassel
Cc: Joao Pinto, Florian Fainelli, Andy Shevchenko, David Miller,
Giuseppe CAVALLARO, larper, rabinv, netdev, CARLOS.PALMINHA,
Jie.Deng1, Stephen Warren
In-Reply-To: <1d445ec1-deb8-6e36-39c4-6813c446095f@axis.com>
[-- Attachment #1: Type: text/plain, Size: 1446 bytes --]
Hi!
> There are some performance problems with the stmmac driver though:
>
> When running iperf3 with 3 streams:
> iperf3 -c 192.168.0.90 -P 3 -t 30
> iperf3 -c 192.168.0.90 -P 3 -t 30 -R
>
> I get really bad fairness between the streams.
>
> This appears to be an issue with how TX IRQ coalescing is implemented in stmmac.
> Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.
Yes, I'm hitting the same problem
https://lkml.org/lkml/2016/12/11/90
> Also netperf TCP_RR and UDP_RR gives really bad results compared to the
> dwceqos driver (without IRQ coalescing).
> 2000 transactions/sec vs 9000 transactions/sec.
> Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
> gives the same performance. I guess it's a trade off, low CPU usage
> vs low latency, so I don't know how important TCP_RR/UDP_RR really is.
Well... 75% performance hit is a bug, plain and simple, not CPU tradeoff.
> The best thing would be to get a good working TX IRQ coalesce
> implementation with HR timers in stmmac.
Actually it seems that using HR timers is not the only problem, AFAICT
the logic is wrong way around. (But yes, it needs to be HR timer, too,
and probably packet count needs to be reduced, too.)
Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* Re: dsa: handling more than 1 cpu port
From: Andrew Lunn @ 2016-12-14 11:00 UTC (permalink / raw)
To: John Crispin; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <0f82a785-b31c-0428-eb07-a9464a0c5ddd@phrozen.org>
On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
>
>
> On 14/12/2016 11:31, Andrew Lunn wrote:
> > On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
> >> Hi Andrew,
> >>
> >> switches supported by qca8k have 2 gmacs that we can wire an external
> >> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
> >> switch. Thw switch itself is aware of one of the MACs being the cpu port
> >> and expects this to be port/mac0. Using the other will break the
> >> hardware offloading features.
> >
> > Just to be sure here. There is no way to use the second port connected
> > to the CPU as a CPU port?
>
> both macs are considered cpu ports and both allow for the tag to be
> injected. for HW NAT/routing/... offloading to work, the lan ports neet
> to trunk via port0 and not port6 however.
Maybe you can do a restricted version of the generic solution. LAN
ports are mapped to cpu port0. WAN port to cpu port 6?
> > The Marvell chips do allow this. So i developed a proof of concept
> > which had a mapping between cpu ports and slave ports. slave port X
> > should you cpu port y for its traffic. This never got past proof of
> > concept.
> >
> > If this can be made to work for qca8k, i would prefer having this
> > general concept, than specific hacks for pass through.
>
> oh cool, can you send those patches my way please ? how do you configure
> this from userland ? does the cpu port get its on swdev which i just add
> to my lan bridge and then add the 2nd cpu port to the wan bridge ?
https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu
You don't configure anything from userland. Which was actually a
criticism. It is in device tree. But my solution is generic. Having
one WAN port and four bridges LAN ports is a pure marketing
requirement. The hardware will happily do two WAN ports and 3 LAN
ports, for example. And the switch is happy to take traffic for the
WAN port and a LAN port over the same CPU port, and keep the traffic
separate. So we can have some form of load balancing. We are not
limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
performance. And to the user it is all transparent.
This PoC is for the old DSA binding. The new binding makes it easier
to express this. Which is one of the reasons for the new binding.
Andrew
^ permalink raw reply
* [PATCH] vsock: lookup and setup guest_cid inside vhost_vsock_lock
From: Gao feng @ 2016-12-14 11:24 UTC (permalink / raw)
To: kvm, netdev; +Cc: stefanha, mst, Gao feng
Multi vsocks may setup the same cid at the same time.
Signed-off-by: Gao feng <omarapazanadi@gmail.com>
---
drivers/vhost/vsock.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e3b30ea..a08332b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -50,11 +50,10 @@ static u32 vhost_transport_get_local_cid(void)
return VHOST_VSOCK_DEFAULT_HOST_CID;
}
-static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid)
{
struct vhost_vsock *vsock;
- spin_lock_bh(&vhost_vsock_lock);
list_for_each_entry(vsock, &vhost_vsock_list, list) {
u32 other_cid = vsock->guest_cid;
@@ -63,15 +62,24 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
continue;
if (other_cid == guest_cid) {
- spin_unlock_bh(&vhost_vsock_lock);
return vsock;
}
}
- spin_unlock_bh(&vhost_vsock_lock);
return NULL;
}
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+{
+ struct vhost_vsock *vsock;
+
+ spin_lock_bh(&vhost_vsock_lock);
+ vsock = __vhost_vsock_get(guest_cid);
+ spin_unlock_bh(&vhost_vsock_lock);
+
+ return vsock;
+}
+
static void
vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct vhost_virtqueue *vq)
@@ -562,11 +570,12 @@ static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u64 guest_cid)
return -EINVAL;
/* Refuse if CID is already in use */
- other = vhost_vsock_get(guest_cid);
- if (other && other != vsock)
- return -EADDRINUSE;
-
spin_lock_bh(&vhost_vsock_lock);
+ other = __vhost_vsock_get(guest_cid);
+ if (other && other != vsock) {
+ spin_unlock_bh(&vhost_vsock_lock);
+ return -EADDRINUSE;
+ }
vsock->guest_cid = guest_cid;
spin_unlock_bh(&vhost_vsock_lock);
--
2.5.5
^ permalink raw reply related
* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Hannes Frederic Sowa @ 2016-12-14 11:21 UTC (permalink / raw)
To: Jason A. Donenfeld, Netdev, kernel-hardening, LKML, linux-crypto
Cc: Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
Eric Biggers
In-Reply-To: <20161214035927.30004-1-Jason@zx2c4.com>
Hello,
On 14.12.2016 04:59, Jason A. Donenfeld wrote:
> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function.
Can you show or cite benchmarks in comparison with jhash? Last time I
looked, especially for short inputs, siphash didn't beat jhash (also on
all the 32 bit devices etc.).
> SipHash isn't just some new trendy hash function. It's been around for a
> while, and there really isn't anything that comes remotely close to
> being useful in the way SipHash is. With that said, why do we need this?
>
> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector.
This pretty much depends on the linearity of the hash function? I don't
think a crypto secure hash function is needed for a hash table. Albeit I
agree that siphash certainly looks good to be used here.
> Linux developers already seem to be aware that this is an issue, and
> various places that use hash tables in, say, a network context, use a
> non-cryptographically secure function (usually jhash) and then try to
> twiddle with the key on a time basis (or in many cases just do nothing
> and hope that nobody notices). While this is an admirable attempt at
> solving the problem, it doesn't actually fix it. SipHash fixes it.
I am pretty sure that SipHash still needs a random key per hash table
also. So far it was only the choice of hash function you are questioning.
> (It fixes it in such a sound way that you could even build a stream
> cipher out of SipHash that would resist the modern cryptanalysis.)
>
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
Hmm, I tried to follow up with all the HashDoS work and so far didn't
see any HashDoS attacks against the Jenkins/SpookyHash family.
If this is an issue we might need to also put those changes into stable.
> Dozens of languages are already using this internally for their hash
> tables. Some of the BSDs already use this in their kernels. SipHash is
> a widely known high-speed solution to a widely known problem, and it's
> time we catch-up.
Bye,
Hannes
^ permalink raw reply
* Re: dsa: handling more than 1 cpu port
From: John Crispin @ 2016-12-14 11:11 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <20161214110058.GE27370@lunn.ch>
On 14/12/2016 12:00, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
>>
>>
>> On 14/12/2016 11:31, Andrew Lunn wrote:
>>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
>>>> Hi Andrew,
>>>>
>>>> switches supported by qca8k have 2 gmacs that we can wire an external
>>>> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
>>>> switch. Thw switch itself is aware of one of the MACs being the cpu port
>>>> and expects this to be port/mac0. Using the other will break the
>>>> hardware offloading features.
>>>
>>> Just to be sure here. There is no way to use the second port connected
>>> to the CPU as a CPU port?
>>
>> both macs are considered cpu ports and both allow for the tag to be
>> injected. for HW NAT/routing/... offloading to work, the lan ports neet
>> to trunk via port0 and not port6 however.
>
> Maybe you can do a restricted version of the generic solution. LAN
> ports are mapped to cpu port0. WAN port to cpu port 6?
hardcoding it is exactly what i want to avoid, same as using magic name
matching. the dts should describe the HW and not the usage dictated by
what is printed on the casing of the switch. as you mention below this
is marketing chatter. imho ports should have any name and we should be
able to bridge them as we feel happy and attach the bridges to any cpu
port that we want.
>
>>> The Marvell chips do allow this. So i developed a proof of concept
>>> which had a mapping between cpu ports and slave ports. slave port X
>>> should you cpu port y for its traffic. This never got past proof of
>>> concept.
>>>
>>> If this can be made to work for qca8k, i would prefer having this
>>> general concept, than specific hacks for pass through.
>>
>> oh cool, can you send those patches my way please ? how do you configure
>> this from userland ? does the cpu port get its on swdev which i just add
>> to my lan bridge and then add the 2nd cpu port to the wan bridge ?
>
> https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu
>
> You don't configure anything from userland. Which was actually a
> criticism. It is in device tree. But my solution is generic. Having
> one WAN port and four bridges LAN ports is a pure marketing
> requirement. The hardware will happily do two WAN ports and 3 LAN
> ports, for example. And the switch is happy to take traffic for the
> WAN port and a LAN port over the same CPU port, and keep the traffic
> separate. So we can have some form of load balancing. We are not
> limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
> performance. And to the user it is all transparent.
>
> This PoC is for the old DSA binding. The new binding makes it easier
> to express this. Which is one of the reasons for the new binding.
>
> Andrew
>
i'll have a look at the patches. thanks !
John
^ permalink raw reply
* Re: [PATCH 2/3] selftests: do not require bash to run bpf tests
From: Daniel Borkmann @ 2016-12-14 11:03 UTC (permalink / raw)
To: Rolf Eike Beer, linux-kselftest
Cc: David S. Miller, netdev, Alexei Starovoitov, linux-kernel,
Shuah Khan, Shuah Khan
In-Reply-To: <3057464.DrP1iNtcaF@devpool21>
On 12/14/2016 11:58 AM, Rolf Eike Beer wrote:
> From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001
> From: Rolf Eike Beer <eike-kernel@sf-tec.de>
> Date: Wed, 14 Dec 2016 09:58:12 +0100
> Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests
>
> Nothing in this minimal script seems to require bash. We often run these tests
> on embedded devices where the only shell available is the busybox ash.
>
> Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply
* [PATCH 2/3] selftests: do not require bash to run bpf tests
From: Rolf Eike Beer @ 2016-12-14 10:58 UTC (permalink / raw)
To: linux-kselftest
Cc: Daniel Borkmann, David S. Miller, netdev, Alexei Starovoitov,
linux-kernel, Shuah Khan, Shuah Khan
>From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001
From: Rolf Eike Beer <eike-kernel@sf-tec.de>
Date: Wed, 14 Dec 2016 09:58:12 +0100
Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests
Nothing in this minimal script seems to require bash. We often run these tests
on embedded devices where the only shell available is the busybox ash.
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
---
tools/testing/selftests/bpf/test_kmod.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/test_kmod.sh b/tools/testing/selftests/bpf/test_kmod.sh
index 92e627a..6d58cca 100755
--- a/tools/testing/selftests/bpf/test_kmod.sh
+++ b/tools/testing/selftests/bpf/test_kmod.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/bin/sh
SRC_TREE=../../../../
--
2.8.3
--
Rolf Eike Beer, emlix GmbH, http://www.emlix.com
Fon +49 551 30664-0, Fax +49 551 30664-11
Bertha-von-Suttner-Str. 9, 37085 Göttingen, Germany
Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055
emlix – smart embedded open source
^ permalink raw reply related
* Re: net/arp: ARP cache aging failed.
From: YueHaibing @ 2016-12-14 10:54 UTC (permalink / raw)
To: Julian Anastasov, Hannes Frederic Sowa
Cc: Eric Dumazet, David S. Miller, netdev
In-Reply-To: <alpine.LFD.2.11.1611252140120.1888@ja.home.ssi.bg>
On 2016/11/26 4:40, Julian Anastasov wrote:
>
> Hello,
>
> On Fri, 25 Nov 2016, Hannes Frederic Sowa wrote:
>
>> On 25.11.2016 09:18, Julian Anastasov wrote:
>>>
>>> Another option would be to add similar bit to
>>> struct sock (sk_dst_pending_confirm), may be ip_finish_output2
>>> can propagate it to dst_neigh_output via new arg and then to
>>> reset it.
>>
>> I don't understand how this can help? Maybe I understood it wrong?
>
> The idea is, the indication from highest level (TCP) to
> be propagated to lowest level (neigh).
>
>>> Or to change n->confirmed via some new inline sock
>>> function instead of changing dst_neigh_output. At this place
>>> skb->sk is optional, may be we need (skb->sk && dst ==
>>> skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes
>>> we should clear this flag.
>>
>> In "(skb->sk && dst == skb->sk->sk_dst_cache)" where does dst come from?
>
> This is in case we do not want to propagate indication
> from TCP to tunnels, see below...
>
>> I don't see a possibility besides using mac_header or inner_mac_header
>> to look up the incoming MAC address and confirm that one in the neighbor
>> cache so far (we could try to optimize this case for rt_gateway though).
>
> My idea is as follows:
>
> - TCP instead of calling dst_confirm should call new func
> dst_confirm_sk(sk) where sk->sk_dst_pending_confirm is set,
> not dst->pending_confirm.
>
> - ip_finish_output2: use skb->sk (TCP) to check for
> sk_dst_pending_confirm and update n->confirmed in some
> new inline func
>
> Correct me if I'm wrong but here is how I see the
> situation:
>
> - TCP caches output dst in socket, for example, dst1,
> sets it as skb->dst
>
> - if xfrm tunnels are used then dst1 can point to some tunnel,
> i.e. it is not in all cases the same skb->dst, as seen by
> ip_finish_output2
>
> - netfilter can DNAT and assign different skb->dst
>
> - as result, before now, dst1->pending_confirm is set but
> it is not used properly because ip_finish_output2 uses
> the final skb->dst which can be different or with
> rt_gateway = 0
>
> - considering that tunnels do not use dst_confirm,
> n->confirmed is not changed every time. There are some
> interesting cases:
>
> 1. both dst1 and the final skb->dst point to same
> dst with rt_gateway = 0: n->confirmed is updated but
> as wee see it can be for wrong neigh.
>
> 2. dst1 is different from skb->dst, so n->confirmed is
> not changed. This can happen on DNAT or when using
> tunnel to secure gateway.
>
> - in ip_finish_output2 we have skb->sk (original TCP sk) and
> sk arg (equal to skb->sk or second option is sock of some UDP
> tunnel, etc). The idea is to use skb->sk->sk_dst_pending_confirm,
> i.e. highest level sk because the sk arg, if different, does not
> have such flag set (tunnels do not call dst_confirm).
>
> - ip_finish_output2 should call new func dst_neigh_confirm_sk
> (which has nothing related to dst, hm... better name is needed):
>
> if (!IS_ERR(neigh)) {
> - int res = dst_neigh_output(dst, neigh, skb);
> + int res;
> +
> + dst_neigh_confirm_sk(skb->sk, neigh);
> + res = dst_neigh_output(dst, neigh, skb);
>
> which should do something like this:
>
> if (sk && sk->sk_dst_pending_confirm) {
> unsigned long now = jiffies;
>
> sk->sk_dst_pending_confirm = 0;
> /* avoid dirtying neighbour */
> if (n->confirmed != now)
> n->confirmed = now;
> }
>
> Additional dst == skb->sk->sk_dst_cache check will
> not propagate the indication on DNAT/tunnel. For me,
> it is better without such check.
>
> So, the idea is to move TCP and other similar
> users to the new dst_confirm_sk() method. If other
> dst_confirm() users are left, they should be checked
> if dsts with rt_gateway = 0 can be wrongly used.
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
>
> .
>
Sorry for so late.
Based on your ideas, I make a patch and test it for a while.
It works for me.
>From dc033fe4bac8931590e15774a298f04ccea8ed4c Mon Sep 17 00:00:00 2001
From: YueHabing <yuehaibing@huawei.com>
Date: Wed, 14 Dec 2016 02:43:51 -0500
Subject: [PATCH] arp: do neigh confirm based on sk arg
Signed-off-by: YueHabing <yuehaibing@huawei.com>
---
include/net/sock.h | 18 ++++++++++++++++++
net/ipv4/ip_output.c | 5 ++++-
net/ipv4/ping.c | 2 +-
net/ipv4/raw.c | 2 +-
net/ipv4/tcp_input.c | 8 ++------
net/ipv4/tcp_metrics.c | 5 +++--
net/ipv4/udp.c | 2 +-
net/ipv6/raw.c | 2 +-
net/ipv6/route.c | 6 +++---
net/ipv6/udp.c | 2 +-
10 files changed, 35 insertions(+), 17 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 92b2697..ece8dfa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -434,6 +434,7 @@ struct sock {
struct sk_buff *sk_send_head;
__s32 sk_peek_off;
int sk_write_pending;
+ unsigned short sk_dst_pending_confirm;
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
@@ -2263,6 +2264,23 @@ static inline void sk_state_store(struct sock *sk, int newstate)
smp_store_release(&sk->sk_state, newstate);
}
+static inline void dst_confirm_sk(struct sock *sk)
+{
+ sk->sk_dst_pending_confirm = 1;
+}
+
+static inline void dst_neigh_confirm_sk(struct sock *sk, struct neighbour *n)
+{
+ if (sk && sk->sk_dst_pending_confirm) {
+ unsigned long now = jiffies;
+
+ sk->sk_dst_pending_confirm = 0;
+ /* avoid dirtying neighbour */
+ if (n->confirmed != now)
+ n->confirmed = now;
+ }
+}
+
void sock_enable_timestamp(struct sock *sk, int flag);
int sock_get_timestamp(struct sock *, struct timeval __user *);
int sock_get_timestampns(struct sock *, struct timespec __user *);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 877bdb0..7c9b640 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -221,7 +221,10 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
if (!IS_ERR(neigh)) {
- int res = dst_neigh_output(dst, neigh, skb);
+ int res;
+
+ dst_neigh_confirm_sk(skb->sk, neigh);
+ res = dst_neigh_output(dst, neigh, skb);
rcu_read_unlock_bh();
return res;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 96b8e2b..bcff1c6 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -847,7 +847,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
return err;
do_confirm:
- dst_confirm(&rt->dst);
+ dst_confirm_sk(sk);
if (!(msg->msg_flags & MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index ecbe5a7..e01424d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -664,7 +664,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
return len;
do_confirm:
- dst_confirm(&rt->dst);
+ dst_confirm_sk(sk);
if (!(msg->msg_flags & MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c71d49c..92b2060 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3702,9 +3702,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tcp_process_tlp_ack(sk, ack, flag);
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
- struct dst_entry *dst = __sk_dst_get(sk);
- if (dst)
- dst_confirm(dst);
+ dst_confirm_sk(sk);
}
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
@@ -6049,9 +6047,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
tcp_set_state(sk, TCP_FIN_WAIT2);
sk->sk_shutdown |= SEND_SHUTDOWN;
- dst = __sk_dst_get(sk);
- if (dst)
- dst_confirm(dst);
+ dst_confirm_sk(sk);
if (!sock_flag(sk, SOCK_DEAD)) {
/* Wake up lingering close() */
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index bf1f3b2..375ff08 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -379,7 +379,8 @@ void tcp_update_metrics(struct sock *sk)
return;
if (dst->flags & DST_HOST)
- dst_confirm(dst);
+ dst_confirm_sk(sk);
+
rcu_read_lock();
if (icsk->icsk_backoff || !tp->srtt_us) {
@@ -496,7 +497,7 @@ void tcp_init_metrics(struct sock *sk)
if (!dst)
goto reset;
- dst_confirm(dst);
+ dst_confirm_sk(sk);
rcu_read_lock();
tm = tcp_get_metrics(sk, dst, true);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5bab6c3..f7852d3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1111,7 +1111,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
return err;
do_confirm:
- dst_confirm(&rt->dst);
+ dst_confirm_sk(sk);
if (!(msg->msg_flags&MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 054a1d8..bbb09a4 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -927,7 +927,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
txopt_put(opt_to_free);
return err < 0 ? err : len;
do_confirm:
- dst_confirm(dst);
+ dst_confirm_sk(sk);
if (!(msg->msg_flags & MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1b57e11..8df4671 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1356,7 +1356,7 @@ static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
}
-static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
+static void __ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
const struct ipv6hdr *iph, u32 mtu)
{
struct rt6_info *rt6 = (struct rt6_info *)dst;
@@ -1367,7 +1367,7 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
if (dst_metric_locked(dst, RTAX_MTU))
return;
- dst_confirm(dst);
+ dst_confirm_sk(sk);
mtu = max_t(u32, mtu, IPV6_MIN_MTU);
if (mtu >= dst_mtu(dst))
return;
@@ -2248,7 +2248,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
* Look, redirects are sent only in response to data packets,
* so that this nexthop apparently is reachable. --ANK
*/
- dst_confirm(&rt->dst);
+ dst_confirm_sk(sk);
neigh = __neigh_lookup(&nd_tbl, &msg->target, skb->dev, 1);
if (!neigh)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e4a8000..6057101 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1309,7 +1309,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
return err;
do_confirm:
- dst_confirm(dst);
+ dst_confirm_sk(sk);
if (!(msg->msg_flags&MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
--
1.8.3.1
^ permalink raw reply related
* (unknown),
From: Mr Friedrich Mayrhofer @ 2016-12-14 9:45 UTC (permalink / raw)
Good Day,
This is the second time i am sending you this mail.
I, Friedrich Mayrhofer Donate $ 1,000,000.00 to You, Email Me
personally for more details.
Regards.
Friedrich Mayrhofer
^ permalink raw reply
* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Andrew Lunn @ 2016-12-14 10:46 UTC (permalink / raw)
To: Volodymyr Bendiuga
Cc: Volodymyr Bendiuga, Florian Fainelli, Vivien Didelot, netdev,
Volodymyr Bendiuga
In-Reply-To: <CABHmqqB-6-L124hrAjNxd4kO3zYp9R=RM1X6PXyuQW6Hapiv4g@mail.gmail.com>
On Wed, Dec 14, 2016 at 11:03:14AM +0100, Volodymyr Bendiuga wrote:
> Hi,
>
> I understand what you wrote, but we do not refresh anything
> at intervals. FDB dump works only on user request, i.e. user
> run bridge fdb dump command. What we did is just a smarter
> way of feeding the dump request with entries obtained from
> switchcore. Interesting thing is that fdb dump in switchcore
> reads all entries, and from those entries only one is returned at
> a time, others are discarded, because request comes as per port.
Ah, O.K.
> What we do is we dump entries from switchcore once, when the
> first dump request comes, save all of them in the cache, and then
> all following fdb dump requests for each port will be fed from the cache,
> so no more hardware dump operations. We set the cache to be valid for
> 2 seconds, but this could be adjusted and tweaked. So in our case
> we decrease the number of MDIO reads quite a lot.
> What we are thinking now, is that this logics could be moved
> to net/dsa layer, and maybe unified with fdb load logics, if possible.
We should check what the other switches do. Can they perform a dump
with the hardware performing the port filter? Those switches will not
benefit from such a cache. But they should also not suffer.
Combining it with load is a bigger question. Currently the drivers are
stateless. That makes the error handling very simple. We don't have to
worry about running out of memory, since we don't allocate any memory.
If we run out of memory for this dump cache, it is not serious, since
a dump does not result in state change. But if we are out of memory on
a load, we do have state changes. We have to deal with the complexity
of allocating resources in the _prepare call, and then use the
resource in the do call. I would much prefer to avoid this at first,
by optimising the atu get. If we don't see a significant speed up,
then we can consider this complex solution of a cache for load.
Andrew
^ permalink raw reply
* Re: dsa: handling more than 1 cpu port
From: John Crispin @ 2016-12-14 10:35 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <20161214103153.GC27370@lunn.ch>
On 14/12/2016 11:31, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
>> Hi Andrew,
>>
>> switches supported by qca8k have 2 gmacs that we can wire an external
>> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
>> switch. Thw switch itself is aware of one of the MACs being the cpu port
>> and expects this to be port/mac0. Using the other will break the
>> hardware offloading features.
>
> Just to be sure here. There is no way to use the second port connected
> to the CPU as a CPU port?
both macs are considered cpu ports and both allow for the tag to be
injected. for HW NAT/routing/... offloading to work, the lan ports neet
to trunk via port0 and not port6 however.
>
> The Marvell chips do allow this. So i developed a proof of concept
> which had a mapping between cpu ports and slave ports. slave port X
> should you cpu port y for its traffic. This never got past proof of
> concept.
>
> If this can be made to work for qca8k, i would prefer having this
> general concept, than specific hacks for pass through.
oh cool, can you send those patches my way please ? how do you configure
this from userland ? does the cpu port get its on swdev which i just add
to my lan bridge and then add the 2nd cpu port to the wan bridge ?
John
^ permalink raw reply
* Re: dsa: handling more than 1 cpu port
From: Andrew Lunn @ 2016-12-14 10:31 UTC (permalink / raw)
To: John Crispin; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <1671ae82-c213-7611-584f-02a3b1d5dff9@phrozen.org>
On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
> Hi Andrew,
>
> switches supported by qca8k have 2 gmacs that we can wire an external
> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
> switch. Thw switch itself is aware of one of the MACs being the cpu port
> and expects this to be port/mac0. Using the other will break the
> hardware offloading features.
Just to be sure here. There is no way to use the second port connected
to the CPU as a CPU port?
The Marvell chips do allow this. So i developed a proof of concept
which had a mapping between cpu ports and slave ports. slave port X
should you cpu port y for its traffic. This never got past proof of
concept.
If this can be made to work for qca8k, i would prefer having this
general concept, than specific hacks for pass through.
Andrew
^ permalink raw reply
* Re: sctp: suspicious rcu_dereference_check() usage in sctp_epaddr_lookup_transport
From: Xin Long @ 2016-12-14 10:04 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Dmitry Vyukov, Vladislav Yasevich, Neil Horman, David Miller,
linux-sctp, netdev, LKML, Eric Dumazet, syzkaller
In-Reply-To: <20161213213730.GA4731@localhost.localdomain>
On Wed, Dec 14, 2016 at 5:37 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Tue, Dec 13, 2016 at 07:07:01PM +0100, Dmitry Vyukov wrote:
>> Hello,
>>
>> I am getting the following reports while running syzkaller fuzzer:
>>
>> [ INFO: suspicious RCU usage. ]
>> 4.9.0+ #85 Not tainted
>> -------------------------------
>> ./include/linux/rhashtable.h:572 suspicious rcu_dereference_check() usage!
>>
>> other info that might help us debug this:
>>
>> rcu_scheduler_active = 1, debug_locks = 0
>> 1 lock held by syz-executor1/18023:
>> #0: (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock
>> include/net/sock.h:1454
>> #0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff87bb3ccf>]
>> sctp_getsockopt+0x45f/0x6800 net/sctp/socket.c:6432
>>
>> stack backtrace:
>> CPU: 2 PID: 18023 Comm: syz-executor1 Not tainted 4.9.0+ #85
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>> [< inline >] __dump_stack lib/dump_stack.c:15
>> [< none >] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> [< none >] lockdep_rcu_suspicious+0x139/0x180
>> kernel/locking/lockdep.c:4448
>> [< inline >] __rhashtable_lookup ./include/linux/rhashtable.h:572
>> [< inline >] rhltable_lookup ./include/linux/rhashtable.h:660
>> [< none >] sctp_epaddr_lookup_transport+0x641/0x930
>> net/sctp/input.c:946
>
> I think this was introduced in the rhlist converstion. We had removed
> some rcu_read_lock() calls on sctp stack because rhashtable was already
> calling it, but then we didn't add them back when moving to rhlist.
>
> This code:
> +/* return a transport without holding it, as it's only used under sock lock */
> struct sctp_transport *sctp_epaddr_lookup_transport(
> const struct sctp_endpoint *ep,
> const union sctp_addr *paddr)
> {
> struct net *net = sock_net(ep->base.sk);
> + struct rhlist_head *tmp, *list;
> + struct sctp_transport *t;
> struct sctp_hash_cmp_arg arg = {
> - .ep = ep,
> .paddr = paddr,
> .net = net,
> + .lport = htons(ep->base.bind_addr.port),
> };
>
> - return rhashtable_lookup_fast(&sctp_transport_hashtable, &arg,
> - sctp_hash_params);
> + list = rhltable_lookup(&sctp_transport_hashtable, &arg,
> + sctp_hash_params);
>
> Had an implicit rcu_read_lock() on rhashtable_lookup_fast, but it
> doesn't on rhltable_lookup and rhltable_lookup uses _rcu calls which
> assumes we have rcu read protection.
You're right, we need to call rcu_read_lock before using rhltable_lookup.
will post a fix for it, thanks.
^ permalink raw reply
* Hi
From: mrsnicole123 @ 2016-12-14 9:49 UTC (permalink / raw)
I am Mrs Nicole Marois, i have a pending project of fulfillment to put
in your hand, i will need your support to make this dream come through,
could you let me know your interest to enable me give you further
information, and I hereby advice that you send the below mentioned
information
I decided to will/donate the sum of $4.5 Million US Dollars to you for
the
good work of God, and also to help the motherless and less privilege
and also forth assistance of the widows. At the moment I cannot take
any telephone calls right now due to the fact that my relatives (that
have squandered the funds agave them for this purpose before) are
around me and my health status also. I have adjusted my will and my
lawyer is aware.
I have willed those properties to you by quoting my personal file
routing and account information. And I have also notified the bank
that I am willing that properties to you for a good, effective and
prudent work. I know I don't know you but I have been directed to do
this by God.ok Please contact this woman for more details you might
not get me on line in time contact this email if you need more
information
ok
Email: mrsnicole2marios@gmail.com
Yours fairly friend,
Mrs Nicole Benoite Marois.
^ permalink raw reply
* dsa: handling more than 1 cpu port
From: John Crispin @ 2016-12-14 10:01 UTC (permalink / raw)
To: Andrew Lunn, Florian Fainelli; +Cc: netdev@vger.kernel.org
Hi Andrew,
switches supported by qca8k have 2 gmacs that we can wire an external
mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
switch. Thw switch itself is aware of one of the MACs being the cpu port
and expects this to be port/mac0. Using the other will break the
hardware offloading features. The original series from Matthieu added a
feature to configure the switch in a pass-through mode [1]. This
resulted in us having to define the pass-through inside the dts which
means that we loose userland configurability. Assume the setup as
follows ...
port0 - cpu port wired to SoCs MAC0
port1-4 - the lan ports
port5 - the wan port
port6 - wired to the SoCs MAC1
What i have done now is bring up one bridge for port1-4 and a second one
for port5-6. Once setup I can pass traffic on the SoCs MAC1 and it will
flow via port6 and egress on port5. So far so good, however due to the
way the port based vlans are setup and how the bridge_join/leave() logic
works, port5/6 will also fwd traffic to the cpu port. the driver has now
to tell that we are trunking traffic on eth1 via port6. also the MII
mode is not known to the driver. Adding some hackish register writes
will make this work nicely. My proposed way of fixing this cleanly in an
upstream friendly way would be as follows
1) add an extra dsa,ethernet property to the 2nd MII port
dsa@0 {
compatible = "qca,ar8xxx";
dsa,ethernet = <&gmac1>;
[...]
switch@0 {
[...]
port@5 {
reg = <5>;
label = "wan";
phy-handle = <&phy_port5>;
};
port@6 {
reg = <6>;
label = "gmac2";
dsa,ethernet = <&gmac2>;
fixed-link {
speed = <1000>;
full-duplex;
};
};
};
};
2) fix up the port_bridge_join/leave() logic such that if a port is
present in the bridge that has a reference to a ethernet interface it
will remove all ports in the bridge from the port based vlan of the
actual cpu port.
3) in case of this switch we also need to fiddle with the bcast/mcast
flooding registers
would this be ok and would it be better to probe the extra dsa_ethernet
inside the subsystem or the driver ? i was considering to do add a
dsa_is_trunk_port() or similar to achieve this.
John
[1] https://patchwork.ozlabs.org/patch/477525/
^ permalink raw reply
* RE: [PATCH 1/3] siphash: add cryptographically secure hashtable function
From: David Laight @ 2016-12-14 9:56 UTC (permalink / raw)
To: 'Jason A. Donenfeld', Netdev, David Miller,
Linus Torvalds, kernel-hardening@lists.openwall.com, LKML,
George Spelvin, Scott Bauer, Andi Kleen, Andy Lutomirski, Greg KH,
Eric Biggers
Cc: Jean-Philippe Aumasson, Daniel J . Bernstein
In-Reply-To: <20161214001656.19388-1-Jason@zx2c4.com>
From: Jason A. Donenfeld
> Sent: 14 December 2016 00:17
> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function.
>
> SipHash isn't just some new trendy hash function. It's been around for a
> while, and there really isn't anything that comes remotely close to
> being useful in the way SipHash is. With that said, why do we need this?
>
> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector.
>
> Linux developers already seem to be aware that this is an issue, and
> various places that use hash tables in, say, a network context, use a
> non-cryptographically secure function (usually jhash) and then try to
> twiddle with the key on a time basis (or in many cases just do nothing
> and hope that nobody notices). While this is an admirable attempt at
> solving the problem, it doesn't actually fix it. SipHash fixes it.
>
> (It fixes it in such a sound way that you could even build a stream
> cipher out of SipHash that would resist the modern cryptanalysis.)
>
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
>
> Dozens of languages are already using this internally for their hash
> tables. Some of the BSDs already use this in their kernels. SipHash is
> a widely known high-speed solution to a widely known problem, and it's
> time we catch-up.
...
> +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN])
...
> + u64 k0 = get_unaligned_le64(key);
> + u64 k1 = get_unaligned_le64(key + sizeof(u64));
...
> + m = get_unaligned_le64(data);
All these unaligned accesses are going to get expensive on architectures
like sparc64.
David
^ permalink raw reply
* [PATCH V2] vhost: introduce O(1) vq metadata cache
From: Jason Wang @ 2016-12-14 9:53 UTC (permalink / raw)
To: mst, kvm, virtualization, netdev, linux-kernel
Cc: vkaplans, maxime.coquelin, wexu, peterx
When device IOTLB is enabled, all address translations were stored in
interval tree. O(lgN) searching time could be slow for virtqueue
metadata (avail, used and descriptors) since they were accessed much
often than other addresses. So this patch introduces an O(1) array
which points to the interval tree nodes that store the translations of
vq metadata. Those array were update during vq IOTLB prefetching and
were reset during each invalidation and tlb update. Each time we want
to access vq metadata, this small array were queried before interval
tree. This would be sufficient for static mappings but not dynamic
mappings, we could do optimizations on top.
Test were done with l2fwd in guest (2M hugepage):
noiommu | before | after
tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)
We can almost reach the same performance as noiommu mode.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
Changes from V1:
- silent 32bit build warning
---
drivers/vhost/vhost.c | 136 ++++++++++++++++++++++++++++++++++++++++----------
drivers/vhost/vhost.h | 8 +++
2 files changed, 118 insertions(+), 26 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c6f2d89..50ed625 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -282,6 +282,22 @@ void vhost_poll_queue(struct vhost_poll *poll)
}
EXPORT_SYMBOL_GPL(vhost_poll_queue);
+static void __vhost_vq_meta_reset(struct vhost_virtqueue *vq)
+{
+ int j;
+
+ for (j = 0; j < VHOST_NUM_ADDRS; j++)
+ vq->meta_iotlb[j] = NULL;
+}
+
+static void vhost_vq_meta_reset(struct vhost_dev *d)
+{
+ int i;
+
+ for (i = 0; i < d->nvqs; ++i)
+ __vhost_vq_meta_reset(d->vqs[i]);
+}
+
static void vhost_vq_reset(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
{
@@ -311,6 +327,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->busyloop_timeout = 0;
vq->umem = NULL;
vq->iotlb = NULL;
+ __vhost_vq_meta_reset(vq);
}
static int vhost_worker(void *data)
@@ -690,6 +707,18 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
return 1;
}
+static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
+ u64 addr, unsigned int size,
+ int type)
+{
+ const struct vhost_umem_node *node = vq->meta_iotlb[type];
+
+ if (!node)
+ return NULL;
+
+ return (void *)(uintptr_t)(node->userspace_addr + addr - node->start);
+}
+
/* Can we switch to this memory table? */
/* Caller should have device mutex but not vq mutex */
static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
@@ -732,8 +761,14 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
* could be access through iotlb. So -EAGAIN should
* not happen in this case.
*/
- /* TODO: more fast path */
struct iov_iter t;
+ void __user *uaddr = vhost_vq_meta_fetch(vq,
+ (u64)(uintptr_t)to, size,
+ VHOST_ADDR_DESC);
+
+ if (uaddr)
+ return __copy_to_user(uaddr, from, size);
+
ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
ARRAY_SIZE(vq->iotlb_iov),
VHOST_ACCESS_WO);
@@ -761,8 +796,14 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
* could be access through iotlb. So -EAGAIN should
* not happen in this case.
*/
- /* TODO: more fast path */
+ void __user *uaddr = vhost_vq_meta_fetch(vq,
+ (u64)(uintptr_t)from, size,
+ VHOST_ADDR_DESC);
struct iov_iter f;
+
+ if (uaddr)
+ return __copy_from_user(to, uaddr, size);
+
ret = translate_desc(vq, (u64)(uintptr_t)from, size, vq->iotlb_iov,
ARRAY_SIZE(vq->iotlb_iov),
VHOST_ACCESS_RO);
@@ -782,17 +823,12 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
return ret;
}
-static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
- void *addr, unsigned size)
+static void __user *__vhost_get_user_slow(struct vhost_virtqueue *vq,
+ void *addr, unsigned int size,
+ int type)
{
int ret;
- /* This function should be called after iotlb
- * prefetch, which means we're sure that vq
- * could be access through iotlb. So -EAGAIN should
- * not happen in this case.
- */
- /* TODO: more fast path */
ret = translate_desc(vq, (u64)(uintptr_t)addr, size, vq->iotlb_iov,
ARRAY_SIZE(vq->iotlb_iov),
VHOST_ACCESS_RO);
@@ -813,14 +849,32 @@ static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
return vq->iotlb_iov[0].iov_base;
}
-#define vhost_put_user(vq, x, ptr) \
+/* This function should be called after iotlb
+ * prefetch, which means we're sure that vq
+ * could be access through iotlb. So -EAGAIN should
+ * not happen in this case.
+ */
+static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
+ void *addr, unsigned int size,
+ int type)
+{
+ void __user *uaddr = vhost_vq_meta_fetch(vq,
+ (u64)(uintptr_t)addr, size, type);
+ if (uaddr)
+ return uaddr;
+
+ return __vhost_get_user_slow(vq, addr, size, type);
+}
+
+#define vhost_put_user(vq, x, ptr) \
({ \
int ret = -EFAULT; \
if (!vq->iotlb) { \
ret = __put_user(x, ptr); \
} else { \
__typeof__(ptr) to = \
- (__typeof__(ptr)) __vhost_get_user(vq, ptr, sizeof(*ptr)); \
+ (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
+ sizeof(*ptr), VHOST_ADDR_USED); \
if (to != NULL) \
ret = __put_user(x, to); \
else \
@@ -829,14 +883,16 @@ static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
ret; \
})
-#define vhost_get_user(vq, x, ptr) \
+#define vhost_get_user(vq, x, ptr, type) \
({ \
int ret; \
if (!vq->iotlb) { \
ret = __get_user(x, ptr); \
} else { \
__typeof__(ptr) from = \
- (__typeof__(ptr)) __vhost_get_user(vq, ptr, sizeof(*ptr)); \
+ (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
+ sizeof(*ptr), \
+ type); \
if (from != NULL) \
ret = __get_user(x, from); \
else \
@@ -845,6 +901,12 @@ static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
ret; \
})
+#define vhost_get_avail(vq, x, ptr) \
+ vhost_get_user(vq, x, ptr, VHOST_ADDR_AVAIL)
+
+#define vhost_get_used(vq, x, ptr) \
+ vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
+
static void vhost_dev_lock_vqs(struct vhost_dev *d)
{
int i = 0;
@@ -950,6 +1012,7 @@ int vhost_process_iotlb_msg(struct vhost_dev *dev,
ret = -EFAULT;
break;
}
+ vhost_vq_meta_reset(dev);
if (vhost_new_umem_range(dev->iotlb, msg->iova, msg->size,
msg->iova + msg->size - 1,
msg->uaddr, msg->perm)) {
@@ -959,6 +1022,7 @@ int vhost_process_iotlb_msg(struct vhost_dev *dev,
vhost_iotlb_notify_vq(dev, msg);
break;
case VHOST_IOTLB_INVALIDATE:
+ vhost_vq_meta_reset(dev);
vhost_del_umem_range(dev->iotlb, msg->iova,
msg->iova + msg->size - 1);
break;
@@ -1102,12 +1166,26 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
sizeof *used + num * sizeof *used->ring + s);
}
+static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
+ const struct vhost_umem_node *node,
+ int type)
+{
+ int access = (type == VHOST_ADDR_USED) ?
+ VHOST_ACCESS_WO : VHOST_ACCESS_RO;
+
+ if (likely(node->perm & access))
+ vq->meta_iotlb[type] = node;
+}
+
static int iotlb_access_ok(struct vhost_virtqueue *vq,
- int access, u64 addr, u64 len)
+ int access, u64 addr, u64 len, int type)
{
const struct vhost_umem_node *node;
struct vhost_umem *umem = vq->iotlb;
- u64 s = 0, size;
+ u64 s = 0, size, orig_addr = addr;
+
+ if (vhost_vq_meta_fetch(vq, addr, len, type))
+ return true;
while (len > s) {
node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
@@ -1124,6 +1202,10 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
}
size = node->size - addr + node->start;
+
+ if (orig_addr == addr && size >= len)
+ vhost_vq_meta_update(vq, node, type);
+
s += size;
addr += size;
}
@@ -1140,13 +1222,15 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
return 1;
return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
- num * sizeof *vq->desc) &&
+ num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
sizeof *vq->avail +
- num * sizeof *vq->avail->ring + s) &&
+ num * sizeof(*vq->avail->ring) + s,
+ VHOST_ADDR_AVAIL) &&
iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
sizeof *vq->used +
- num * sizeof *vq->used->ring + s);
+ num * sizeof(*vq->used->ring) + s,
+ VHOST_ADDR_USED);
}
EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
@@ -1729,7 +1813,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
r = -EFAULT;
goto err;
}
- r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
+ r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
if (r) {
vq_err(vq, "Can't access used idx at %p\n",
&vq->used->idx);
@@ -1932,7 +2016,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
- if (unlikely(vhost_get_user(vq, avail_idx, &vq->avail->idx))) {
+ if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
&vq->avail->idx);
return -EFAULT;
@@ -1954,7 +2038,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Grab the next descriptor number they're advertising, and increment
* the index we've seen. */
- if (unlikely(vhost_get_user(vq, ring_head,
+ if (unlikely(vhost_get_avail(vq, ring_head,
&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
last_avail_idx,
@@ -2170,7 +2254,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
__virtio16 flags;
- if (vhost_get_user(vq, flags, &vq->avail->flags)) {
+ if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
vq_err(vq, "Failed to get flags");
return true;
}
@@ -2184,7 +2268,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (unlikely(!v))
return true;
- if (vhost_get_user(vq, event, vhost_used_event(vq))) {
+ if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
vq_err(vq, "Failed to get used event idx");
return true;
}
@@ -2226,7 +2310,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
__virtio16 avail_idx;
int r;
- r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
+ r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
if (r)
return false;
@@ -2261,7 +2345,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
* sure it's written, then check again. */
smp_mb();
- r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
+ r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
if (r) {
vq_err(vq, "Failed to check avail idx at %p: %d\n",
&vq->avail->idx, r);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 78f3c5f..034ea18 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -76,6 +76,13 @@ struct vhost_umem {
int numem;
};
+enum vhost_uaddr_type {
+ VHOST_ADDR_DESC = 0,
+ VHOST_ADDR_AVAIL = 1,
+ VHOST_ADDR_USED = 2,
+ VHOST_NUM_ADDRS = 3,
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -86,6 +93,7 @@ struct vhost_virtqueue {
struct vring_desc __user *desc;
struct vring_avail __user *avail;
struct vring_used __user *used;
+ const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
struct file *kick;
struct file *call;
struct file *error;
--
2.7.4
^ permalink raw reply related
* RE: [PATCH 3/3] secure_seq: use fast&secure siphash instead of slow&insecure md5
From: David Laight @ 2016-12-14 9:51 UTC (permalink / raw)
To: 'Jason A. Donenfeld', Netdev, David Miller,
Linus Torvalds, kernel-hardening@lists.openwall.com, LKML,
George Spelvin, Scott Bauer, Andi Kleen, Andy Lutomirski, Greg KH,
Eric Biggers
In-Reply-To: <20161214001656.19388-3-Jason@zx2c4.com>
From: Jason A. Donenfeld
> Sent: 14 December 2016 00:17
> This gives a clear speed and security improvement. Rather than manually
> filling MD5 buffers, we simply create a layout by a simple anonymous
> struct, for which gcc generates rather efficient code.
...
> + const struct {
> + struct in6_addr saddr;
> + struct in6_addr daddr;
> + __be16 sport;
> + __be16 dport;
> + } __packed combined = {
> + .saddr = *(struct in6_addr *)saddr,
> + .daddr = *(struct in6_addr *)daddr,
> + .sport = sport,
> + .dport = dport
> + };
You need to look at the effect of marking this (and the other)
structures 'packed' on architectures like sparc64.
David
^ permalink raw reply
* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-14 9:39 UTC (permalink / raw)
To: John Fastabend
Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich,
brouer
In-Reply-To: <58505535.1080908@gmail.com>
On Tue, 13 Dec 2016 12:08:21 -0800
John Fastabend <john.fastabend@gmail.com> wrote:
> On 16-12-13 11:53 AM, David Miller wrote:
> > From: John Fastabend <john.fastabend@gmail.com>
> > Date: Tue, 13 Dec 2016 09:43:59 -0800
> >
> >> What does "zero-copy send packet-pages to the application/socket that
> >> requested this" mean? At the moment on x86 page-flipping appears to be
> >> more expensive than memcpy (I can post some data shortly) and shared
> >> memory was proposed and rejected for security reasons when we were
> >> working on bifurcated driver.
> >
> > The whole idea is that we map all the active RX ring pages into
> > userspace from the start.
> >
> > And just how Jesper's page pool work will avoid DMA map/unmap,
> > it will also avoid changing the userspace mapping of the pages
> > as well.
> >
> > Thus avoiding the TLB/VM overhead altogether.
> >
Exactly. It is worth mentioning that pages entering the page pool need
to be cleared (measured cost 143 cycles), in order to not leak any
kernel info. The primary focus of this design is to make sure not to
leak kernel info to userspace, but with an "exclusive" mode also
support isolation between applications.
> I get this but it requires applications to be isolated. The pages from
> a queue can not be shared between multiple applications in different
> trust domains. And the application has to be cooperative meaning it
> can't "look" at data that has not been marked by the stack as OK. In
> these schemes we tend to end up with something like virtio/vhost or
> af_packet.
I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
two would require CAP_NET_ADMIN privileges. All modes have a trust
domain id, that need to match e.g. when page reach the socket.
Mode-1 "Shared": Application choose lowest isolation level, allowing
multiple application to mmap VMA area.
Mode-2 "Single-user": Application request it want to be the only user
of the RX queue. This blocks other application to mmap VMA area.
Mode-3 "Exclusive": Application request to own RX queue. Packets are
no longer allowed for normal netstack delivery.
Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
still allowed to travel netstack and thus can contain packet data from
other normal applications. This is part of the design, to share the
NIC between netstack and an accelerated userspace application using RX
zero-copy delivery.
> Any ACLs/filtering/switching/headers need to be done in hardware or
> the application trust boundaries are broken.
The software solution outlined allow the application to make the choice
of what trust boundary it wants.
The "exclusive" mode-3 make most sense together with HW filters.
Already today, we support creating a new RX queue based on ethtool
ntuple HW filter and then you simply attach your application that queue
in mode-3, and have full isolation.
> If the above can not be met then a copy is needed. What I am trying
> to tease out is the above comment along with other statements like
> this "can be done with out HW filter features".
Does this address your concerns?
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH] vhost: introduce O(1) vq metadata cache
From: Jason Wang @ 2016-12-14 9:33 UTC (permalink / raw)
To: kbuild test robot
Cc: kvm, mst, netdev, linux-kernel, peterx, virtualization,
maxime.coquelin, kbuild-all, vkaplans, wexu
In-Reply-To: <201612141605.QJowNf4i%fengguang.wu@intel.com>
On 2016年12月14日 16:14, kbuild test robot wrote:
> Hi Jason,
>
> [auto build test WARNING on vhost/linux-next]
> [also build test WARNING on v4.9 next-20161214]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Jason-Wang/vhost-introduce-O-1-vq-metadata-cache/20161214-160153
> base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
> config: i386-randconfig-x005-201650 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
Thanks, V2 will be posted soon.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [Query] Delayed vxlan socket creation?
From: Jiri Benc @ 2016-12-14 9:29 UTC (permalink / raw)
To: Du, Fan; +Cc: netdev@vger.kernel.org, mrjana@gmail.com
In-Reply-To: <5A90DA2E42F8AE43BC4A093BF06788481A9457F1@SHSMSX103.ccr.corp.intel.com>
On Wed, 14 Dec 2016 07:49:24 +0000, Du, Fan wrote:
> I'm interested to one Docker issue[1] which looks like related to kernel vxlan socket creation
> as described in the thread. From my limited knowledge here, socket creation is synchronous ,
> and after the *socket* syscall, the sock handle will be valid and ready to linkup.
>
> Somehow I'm not sure the detailed scenario here, and which/how possible commit fix?
baf606d9c9b1^..56ef9c909b40
Jiri
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox