* [PATCH V2 0/7] net: lorawan: Add LoRaWAN soft MAC module
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, linux-arm-kernel, linux-kernel, Marcel Holtmann,
David S . Miller, Dollar Chen, Ken Yu, linux-wpan, Stefan Schmidt,
Jian-Hong Pan
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
LoRaWAN(TM) is the MAC layer defined by LoRa Alliance(TM) over LoRa
devices. LoRa is one of Low-Power Wide-Area Network (LPWAN) technology.
LoRaWAN networks typically are laid out in a star-of-stars topology in
which gateways relay messages between end-devices and a central network
server at the backend. Gateways are connected to the network server via
standard IP connections while end-devices use single hop LoRa(TM) or FSK
communication to one or many gateways.
A LoRa network distinguishes between a basic LoRaWAN (named Class A) and
optional features (Class B, Class C ...):
* Bi-directional end-devices (Class A)
* Bi-directional end-devices with scheduled receive slots (Class B)
* Bi-directional end-devices with maximal receive slots (Class C)
This patch set add LoRaWAN class module implementing the stack,
especially the soft MAC, between socket APIs and LoRa device drivers.
socket APIs:
send and receive the data
------------------------------------------------------------------------
LoRaWAN class module implements soft MAC:
append the header/footer, encryption/decryption, timing slot and MAC
commands
------------------------------------------------------------------------
LoRa device drivers:
send and receive the messages for MAC layer
------------------------------------------------------------------------
LoRa devices
This module starts from simple and implements partial Class A
end-devices features defined in LoRaWAN(TM) Specification Ver. 1.0.2.
More features and complexity, for example regional parameters, confirmed
data messages, join request/accept messages for Over-The-Air Activation,
MAC commands ... will be added in the future.
Jian-Hong Pan (7):
net: lorawan: Add macro and definition for LoRaWAN
net: lorawan: Add LoRaWAN socket module
net: lorawan: Add LoRaWAN API declaration for LoRa devices
net: maclorawan: Add maclorawan module declaration
net: maclorawan: Implement the crypto of maclorawan module
net: maclorawan: Implement maclorawan class module
net: lorawan: List LORAWAN in menuconfig
include/linux/lora/lorawan.h | 137 ++++++
include/linux/lora/lorawan_netdev.h | 52 +++
include/linux/socket.h | 5 +-
include/uapi/linux/if_arp.h | 1 +
include/uapi/linux/if_ether.h | 1 +
net/Kconfig | 2 +
net/Makefile | 2 +
net/core/dev.c | 4 +-
net/lorawan/Kconfig | 10 +
net/lorawan/Makefile | 2 +
net/lorawan/socket.c | 681 ++++++++++++++++++++++++++++
net/maclorawan/Kconfig | 14 +
net/maclorawan/Makefile | 2 +
net/maclorawan/crypto.c | 209 +++++++++
net/maclorawan/crypto.h | 27 ++
net/maclorawan/mac.c | 522 +++++++++++++++++++++
net/maclorawan/maclorawan.h | 199 ++++++++
net/maclorawan/main.c | 600 ++++++++++++++++++++++++
security/selinux/hooks.c | 4 +-
security/selinux/include/classmap.h | 4 +-
20 files changed, 2473 insertions(+), 5 deletions(-)
create mode 100644 include/linux/lora/lorawan.h
create mode 100644 include/linux/lora/lorawan_netdev.h
create mode 100644 net/lorawan/Kconfig
create mode 100644 net/lorawan/Makefile
create mode 100644 net/lorawan/socket.c
create mode 100644 net/maclorawan/Kconfig
create mode 100644 net/maclorawan/Makefile
create mode 100644 net/maclorawan/crypto.c
create mode 100644 net/maclorawan/crypto.h
create mode 100644 net/maclorawan/mac.c
create mode 100644 net/maclorawan/maclorawan.h
create mode 100644 net/maclorawan/main.c
--
2.19.1
^ permalink raw reply
* [PATCH V2 1/7] net: lorawan: Add macro and definition for LoRaWAN
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, linux-arm-kernel, linux-kernel, Marcel Holtmann,
David S . Miller, Dollar Chen, Ken Yu, linux-wpan, Stefan Schmidt,
Jian-Hong Pan
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
This patch adds the macro and definition for the implementation of
LoRaWAN protocol.
Signed-off-by: Jian-Hong Pan <starnight@g.ncu.edu.tw>
---
V2:
- Modify the commit message
include/linux/socket.h | 5 ++++-
include/uapi/linux/if_arp.h | 1 +
include/uapi/linux/if_ether.h | 1 +
net/core/dev.c | 4 ++--
security/selinux/hooks.c | 4 +++-
security/selinux/include/classmap.h | 4 +++-
6 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index aa1e288b1659..e5c8381fd1aa 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -209,8 +209,9 @@ struct ucred {
*/
#define AF_XDP 44 /* XDP sockets */
#define AF_LORA 45 /* LoRa sockets */
+#define AF_LORAWAN 46 /* LoRaWAN sockets */
-#define AF_MAX 46 /* For now.. */
+#define AF_MAX 47 /* For now.. */
/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
@@ -261,6 +262,7 @@ struct ucred {
#define PF_SMC AF_SMC
#define PF_XDP AF_XDP
#define PF_LORA AF_LORA
+#define PF_LORAWAN AF_LORAWAN
#define PF_MAX AF_MAX
/* Maximum queue length specifiable by listen. */
@@ -343,6 +345,7 @@ struct ucred {
#define SOL_KCM 281
#define SOL_TLS 282
#define SOL_XDP 283
+#define SOL_LORAWAN 284
/* IPX options */
#define IPX_TYPE 1
diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
index 1ed7cb3f2129..2376f7839355 100644
--- a/include/uapi/linux/if_arp.h
+++ b/include/uapi/linux/if_arp.h
@@ -99,6 +99,7 @@
#define ARPHRD_6LOWPAN 825 /* IPv6 over LoWPAN */
#define ARPHRD_VSOCKMON 826 /* Vsock monitor header */
#define ARPHRD_LORA 827 /* LoRa */
+#define ARPHRD_LORAWAN 828 /* LoRaWAN */
#define ARPHRD_VOID 0xFFFF /* Void type, nothing is known */
#define ARPHRD_NONE 0xFFFE /* zero header length */
diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 45644dcf5b39..b1ac70d4a377 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -148,6 +148,7 @@
* aggregation protocol
*/
#define ETH_P_LORA 0x00FA /* LoRa */
+#define ETH_P_LORAWAN 0x00FB /* LoRaWAN */
/*
* This is an Ethernet frame header.
diff --git a/net/core/dev.c b/net/core/dev.c
index f68122f0ab02..b95ce79ec5a8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -297,7 +297,7 @@ static const unsigned short netdev_lock_type[] = {
ARPHRD_IRDA, ARPHRD_FCPP, ARPHRD_FCAL, ARPHRD_FCPL,
ARPHRD_FCFABRIC, ARPHRD_IEEE80211, ARPHRD_IEEE80211_PRISM,
ARPHRD_IEEE80211_RADIOTAP, ARPHRD_PHONET, ARPHRD_PHONET_PIPE,
- ARPHRD_IEEE802154, ARPHRD_VOID, ARPHRD_NONE};
+ ARPHRD_IEEE802154, ARPHRD_LORAWAN, ARPHRD_VOID, ARPHRD_NONE};
static const char *const netdev_lock_name[] = {
"_xmit_NETROM", "_xmit_ETHER", "_xmit_EETHER", "_xmit_AX25",
@@ -314,7 +314,7 @@ static const char *const netdev_lock_name[] = {
"_xmit_IRDA", "_xmit_FCPP", "_xmit_FCAL", "_xmit_FCPL",
"_xmit_FCFABRIC", "_xmit_IEEE80211", "_xmit_IEEE80211_PRISM",
"_xmit_IEEE80211_RADIOTAP", "_xmit_PHONET", "_xmit_PHONET_PIPE",
- "_xmit_IEEE802154", "_xmit_VOID", "_xmit_NONE"};
+ "_xmit_IEEE802154", "_xmit_LORAWAN", "_xmit_VOID", "_xmit_NONE"};
static struct lock_class_key netdev_xmit_lock_key[ARRAY_SIZE(netdev_lock_type)];
static struct lock_class_key netdev_addr_lock_key[ARRAY_SIZE(netdev_lock_type)];
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index aaf520a689d8..0da3a1d69cb8 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1477,7 +1477,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
return SECCLASS_XDP_SOCKET;
case PF_LORA:
return SECCLASS_LORA_SOCKET;
-#if PF_MAX > 46
+ case PF_LORAWAN:
+ return SECCLASS_LORAWAN_SOCKET;
+#if PF_MAX > 47
#error New address family defined, please update this function.
#endif
}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 060d4bf8385e..fa0151fe6f32 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -244,9 +244,11 @@ struct security_class_mapping secclass_map[] = {
{ COMMON_SOCK_PERMS, NULL } },
{ "lora_socket",
{ COMMON_SOCK_PERMS, NULL } },
+ { "lorawan_socket",
+ { COMMON_SOCK_PERMS, NULL } },
{ NULL }
};
-#if PF_MAX > 46
+#if PF_MAX > 47
#error New address family defined, please update secclass_map.
#endif
--
2.19.1
^ permalink raw reply related
* [PATCH V2 2/7] net: lorawan: Add LoRaWAN socket module
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, linux-arm-kernel, linux-kernel, Marcel Holtmann,
David S . Miller, Dollar Chen, Ken Yu, linux-wpan, Stefan Schmidt,
Jian-Hong Pan
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
This patch adds a new address/protocol family for LoRaWAN network.
It also implements the the functions and maps to Datagram socket for
LoRaWAN unconfirmed data messages.
Signed-off-by: Jian-Hong Pan <starnight@g.ncu.edu.tw>
---
V2:
- Split the LoRaWAN class module patch in V1 into LoRaWAN socket and
LoRaWAN Soft MAC modules
- Add lorawan_netdev.h header file for network address related
declaration
- Use SPDX license identifiers
include/linux/lora/lorawan_netdev.h | 52 +++
net/lorawan/Kconfig | 10 +
net/lorawan/Makefile | 2 +
net/lorawan/socket.c | 681 ++++++++++++++++++++++++++++
4 files changed, 745 insertions(+)
create mode 100644 include/linux/lora/lorawan_netdev.h
create mode 100644 net/lorawan/Kconfig
create mode 100644 net/lorawan/Makefile
create mode 100644 net/lorawan/socket.c
diff --git a/include/linux/lora/lorawan_netdev.h b/include/linux/lora/lorawan_netdev.h
new file mode 100644
index 000000000000..4adf93fd06c5
--- /dev/null
+++ b/include/linux/lora/lorawan_netdev.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN stack related definitions
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#ifndef __LORAWAN_NET_DEVICE_H__
+#define __LORAWAN_NET_DEVICE_H__
+
+enum {
+ LRW_ADDR_APPEUI,
+ LRW_ADDR_DEVEUI,
+ LRW_ADDR_DEVADDR,
+};
+
+struct lrw_addr_in {
+ int addr_type;
+ union {
+ u64 app_eui;
+ u64 dev_eui;
+ u32 devaddr;
+ };
+};
+
+struct sockaddr_lorawan {
+ sa_family_t family; /* AF_LORAWAN */
+ struct lrw_addr_in addr_in;
+};
+
+/**
+ * lrw_mac_cb - This structure holds the control buffer (cb) of sk_buff
+ *
+ * @devaddr: the LoRaWAN device address of this LoRaWAN hardware
+ */
+struct lrw_mac_cb {
+ u32 devaddr;
+};
+
+/**
+ * mac_cb - Get the LoRaWAN MAC control buffer of the sk_buff
+ * @skb: the exchanging sk_buff
+ *
+ * Return: the pointer of LoRaWAN MAC control buffer
+ */
+static inline struct lrw_mac_cb * mac_cb(struct sk_buff *skb)
+{
+ return (struct lrw_mac_cb *)skb->cb;
+}
+
+#endif
diff --git a/net/lorawan/Kconfig b/net/lorawan/Kconfig
new file mode 100644
index 000000000000..7f2f344085c4
--- /dev/null
+++ b/net/lorawan/Kconfig
@@ -0,0 +1,10 @@
+config LORAWAN
+ tristate "LoRaWAN Network support"
+ ---help---
+ LoRaWAN defines low data rate, low power and long range wireless
+ wide area networks. It was designed to organize networks of automation
+ devices, such as sensors, switches and actuators. It can operate
+ multiple kilometers wide.
+
+ Say Y here to compile LoRaWAN support into the kernel or say M to
+ compile it as a module.
diff --git a/net/lorawan/Makefile b/net/lorawan/Makefile
new file mode 100644
index 000000000000..8c923ca6541a
--- /dev/null
+++ b/net/lorawan/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_LORAWAN) += lorawan.o
+lorawan-objs := socket.o
diff --git a/net/lorawan/socket.c b/net/lorawan/socket.c
new file mode 100644
index 000000000000..0c03f7a0fb0e
--- /dev/null
+++ b/net/lorawan/socket.c
@@ -0,0 +1,681 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN stack related definitions
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/net.h>
+#include <linux/if_arp.h>
+#include <linux/termios.h> /* For TIOCOUTQ/INQ */
+#include <net/sock.h>
+#include <linux/lora/lorawan_netdev.h>
+
+#define LORAWAN_MODULE_NAME "lorawan"
+
+#define LRW_DBG_STR(fmt) LORAWAN_MODULE_NAME": "fmt
+#define lrw_info(fmt, ...) (pr_info(LRW_DBG_STR(fmt), ##__VA_ARGS__))
+#define lrw_dbg(fmt, ...) (pr_debug(LRW_DBG_STR(fmt), ##__VA_ARGS__))
+
+/**
+ * dgram_sock - This structure holds the states of Datagram socket
+ *
+ * @sk: network layer representation of the socket
+ * sk must be the first member of dgram_sock
+ * @src_devaddr: the LoRaWAN device address for this connection
+ * @bound: this socket is bound or not
+ * @connected: this socket is connected to the destination or not
+ * @want_ack: this socket needs to ack for the connection or not
+ */
+struct dgram_sock {
+ struct sock sk;
+ u32 src_devaddr;
+
+ u8 bound:1;
+ u8 connected:1;
+};
+
+static HLIST_HEAD(dgram_head);
+static DEFINE_RWLOCK(dgram_lock);
+
+static inline struct dgram_sock *
+dgram_sk(const struct sock *sk)
+{
+ return container_of(sk, struct dgram_sock, sk);
+}
+
+static inline struct net_device *
+lrw_get_dev_by_addr(struct net *net, u32 devaddr)
+{
+ struct net_device *ndev = NULL;
+ __be32 be_addr = cpu_to_be32(devaddr);
+
+ rcu_read_lock();
+ ndev = dev_getbyhwaddr_rcu(net, ARPHRD_LORAWAN, (char *)&be_addr);
+ if (ndev)
+ dev_hold(ndev);
+ rcu_read_unlock();
+
+ return ndev;
+}
+
+static int
+dgram_init(struct sock *sk)
+{
+ return 0;
+}
+
+static void
+dgram_close(struct sock *sk, long timeout)
+{
+ sk_common_release(sk);
+}
+
+static int
+dgram_bind(struct sock *sk, struct sockaddr *uaddr, int len)
+{
+ struct sockaddr_lorawan *addr = (struct sockaddr_lorawan *)uaddr;
+ struct dgram_sock *ro = dgram_sk(sk);
+ struct net_device *ndev;
+ int ret;
+
+ lock_sock(sk);
+ ro->bound = 0;
+
+ ret = -EINVAL;
+ if (len < sizeof(*addr))
+ goto dgram_bind_end;
+
+ if (addr->family != AF_LORAWAN)
+ goto dgram_bind_end;
+
+ if (addr->addr_in.addr_type != LRW_ADDR_DEVADDR)
+ goto dgram_bind_end;
+
+ lrw_dbg("%s: bind address %X\n", __func__, addr->addr_in.devaddr);
+ ndev = lrw_get_dev_by_addr(sock_net(sk), addr->addr_in.devaddr);
+ if (!ndev) {
+ ret = -ENODEV;
+ goto dgram_bind_end;
+ }
+ netdev_dbg(ndev, "%s: get ndev\n", __func__);
+
+ if (ndev->type != ARPHRD_LORAWAN) {
+ ret = -ENODEV;
+ goto dgram_bind_end;
+ }
+
+ ro->src_devaddr = addr->addr_in.devaddr;
+ ro->bound = 1;
+ ret = 0;
+ dev_put(ndev);
+ lrw_dbg("%s: bound address %X\n", __func__, ro->src_devaddr);
+
+dgram_bind_end:
+ release_sock(sk);
+ return ret;
+}
+
+static inline int
+lrw_dev_hard_header(struct sk_buff *skb, struct net_device *ndev,
+ const u32 src_devaddr, size_t len)
+{
+ /* TODO: Prepare the LoRaWAN sending header here */
+ return 0;
+}
+
+static int
+dgram_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+ struct dgram_sock *ro = dgram_sk(sk);
+ struct net_device *ndev;
+ struct sk_buff *skb;
+ size_t hlen;
+ size_t tlen;
+ int ret;
+
+ lrw_dbg("%s: going to send %zu bytes", __func__, size);
+ if (msg->msg_flags & MSG_OOB) {
+ lrw_dbg("msg->msg_flags = 0x%x\n", msg->msg_flags);
+ return -EOPNOTSUPP;
+ }
+
+ lrw_dbg("%s: check msg_name\n", __func__);
+ if (!ro->connected && !msg->msg_name)
+ return -EDESTADDRREQ;
+ else if (ro->connected && msg->msg_name)
+ return -EISCONN;
+
+ lrw_dbg("%s: check bound\n", __func__);
+ if (!ro->bound)
+ ndev = dev_getfirstbyhwtype(sock_net(sk), ARPHRD_LORAWAN);
+ else
+ ndev = lrw_get_dev_by_addr(sock_net(sk), ro->src_devaddr);
+
+ if (!ndev) {
+ lrw_dbg("no dev\n");
+ ret = -ENXIO;
+ goto dgram_sendmsg_end;
+ }
+
+ if (size > ndev->mtu){
+ netdev_dbg(ndev, "size = %zu, mtu = %u\n", size, ndev->mtu);
+ ret = -EMSGSIZE;
+ goto dgram_sendmsg_end;
+ }
+
+ netdev_dbg(ndev, "%s: create skb\n", __func__);
+ hlen = LL_RESERVED_SPACE(ndev);
+ tlen = ndev->needed_tailroom;
+ skb = sock_alloc_send_skb(sk, hlen + tlen + size,
+ msg->msg_flags & MSG_DONTWAIT,
+ &ret);
+
+ if (!skb)
+ goto dgram_sendmsg_no_skb;
+
+ skb_reserve(skb, hlen);
+ skb_reset_network_header(skb);
+
+ ret = lrw_dev_hard_header(skb, ndev, 0, size);
+ if (ret < 0)
+ goto dgram_sendmsg_no_skb;
+
+ ret = memcpy_from_msg(skb_put(skb, size), msg, size);
+ if (ret > 0)
+ goto dgram_sendmsg_err_skb;
+
+ skb->dev = ndev;
+ skb->protocol = htons(ETH_P_LORAWAN);
+
+ netdev_dbg(ndev, "%s: push skb to xmit queue\n", __func__);
+ ret = dev_queue_xmit(skb);
+ if (ret > 0)
+ ret = net_xmit_errno(ret);
+ netdev_dbg(ndev, "%s: pushed skb to xmit queue with ret=%d\n",
+ __func__, ret);
+ dev_put(ndev);
+
+ return ret ?: size;
+
+dgram_sendmsg_err_skb:
+ kfree_skb(skb);
+dgram_sendmsg_no_skb:
+ dev_put(ndev);
+
+dgram_sendmsg_end:
+ return ret;
+}
+
+static int
+dgram_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+ int noblock, int flags, int *addr_len)
+{
+ struct sk_buff *skb;
+ size_t copied = 0;
+ DECLARE_SOCKADDR(struct sockaddr_lorawan *, saddr, msg->msg_name);
+ int err;
+
+ skb = skb_recv_datagram(sk, flags, noblock, &err);
+ if (!skb)
+ goto dgram_recvmsg_end;
+
+ copied = skb->len;
+ if (len < copied) {
+ msg->msg_flags |= MSG_TRUNC;
+ copied = len;
+ }
+
+ err = skb_copy_datagram_msg(skb, 0, msg, copied);
+ if (err)
+ goto dgram_recvmsg_done;
+
+ sock_recv_ts_and_drops(msg, sk, skb);
+ if(saddr) {
+ memset(saddr, 0, sizeof(*saddr));
+ saddr->family = AF_LORAWAN;
+ saddr->addr_in.devaddr = mac_cb(skb)->devaddr;
+ *addr_len = sizeof(*saddr);
+ }
+
+ if (flags & MSG_TRUNC)
+ copied = skb->len;
+
+dgram_recvmsg_done:
+ skb_free_datagram(sk, skb);
+
+dgram_recvmsg_end:
+ if (err)
+ return err;
+ return copied;
+}
+
+static int
+dgram_hash(struct sock *sk)
+{
+ lrw_dbg("%s\n", __func__);
+ write_lock_bh(&dgram_lock);
+ sk_add_node(sk, &dgram_head);
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+ write_unlock_bh(&dgram_lock);
+
+ return 0;
+}
+
+static void
+dgram_unhash(struct sock *sk)
+{
+ lrw_dbg("%s\n", __func__);
+ write_lock_bh(&dgram_lock);
+ if (sk_del_node_init(sk))
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+ write_unlock_bh(&dgram_lock);
+}
+
+static int
+dgram_connect(struct sock *sk, struct sockaddr *uaddr, int len)
+{
+ struct dgram_sock *ro = dgram_sk(sk);
+
+ /* Nodes of LoRaWAN send data to a gateway only, then data is received
+ * and transferred to servers with the gateway's policy.
+ * So, the destination address is not used by nodes.
+ */
+ lock_sock(sk);
+ ro->connected = 1;
+ release_sock(sk);
+
+ return 0;
+}
+
+static int
+dgram_disconnect(struct sock *sk, int flags)
+{
+ struct dgram_sock *ro = dgram_sk(sk);
+
+ lock_sock(sk);
+ ro->connected = 0;
+ release_sock(sk);
+
+ return 0;
+}
+
+static int
+dgram_ioctl(struct sock *sk, int cmd, unsigned long arg)
+{
+ struct sk_buff *skb;
+ int amount;
+ int err;
+ struct net_device *ndev = sk->sk_dst_cache->dev;
+
+ netdev_dbg(ndev, "%s: ioctl file (cmd=0x%X)\n", __func__, cmd);
+ switch (cmd) {
+ case SIOCOUTQ:
+ amount = sk_wmem_alloc_get(sk);
+ err = put_user(amount, (int __user *)arg);
+ break;
+ case SIOCINQ:
+ amount = 0;
+ spin_lock_bh(&sk->sk_receive_queue.lock);
+ skb = skb_peek(&sk->sk_receive_queue);
+ if (skb) {
+ /* We will only return the amount of this packet
+ * since that is all that will be read.
+ */
+ amount = skb->len;
+ }
+ spin_unlock_bh(&sk->sk_receive_queue.lock);
+ err = put_user(amount, (int __user *)arg);
+ break;
+ default:
+ err = -ENOIOCTLCMD;
+ }
+
+ return err;
+}
+
+static int
+dgram_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen)
+{
+ int val, len;
+
+ if (level != SOL_LORAWAN)
+ return -EOPNOTSUPP;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+
+ len = min_t(unsigned int, len, sizeof(int));
+
+ switch (optname) {
+ default:
+ return -ENOPROTOOPT;
+ }
+
+ if (put_user(len, optlen))
+ return -EFAULT;
+
+ if (copy_to_user(optval, &val, len))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int
+dgram_setsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, unsigned int optlen)
+{
+ int val;
+ int err = 0;
+
+ if (optlen < sizeof(int))
+ return -EINVAL;
+
+ if (get_user(val, (int __user *)optval))
+ return -EFAULT;
+
+ lock_sock(sk);
+
+ switch (optname) {
+ default:
+ err = -ENOPROTOOPT;
+ break;
+ }
+
+ release_sock(sk);
+
+ return err;
+}
+
+static struct proto lrw_dgram_prot = {
+ .name = "LoRaWAN",
+ .owner = THIS_MODULE,
+ .obj_size = sizeof(struct dgram_sock),
+ .init = dgram_init,
+ .close = dgram_close,
+ .bind = dgram_bind,
+ .sendmsg = dgram_sendmsg,
+ .recvmsg = dgram_recvmsg,
+ .hash = dgram_hash,
+ .unhash = dgram_unhash,
+ .connect = dgram_connect,
+ .disconnect = dgram_disconnect,
+ .ioctl = dgram_ioctl,
+ .getsockopt = dgram_getsockopt,
+ .setsockopt = dgram_setsockopt,
+};
+
+static int
+lrw_sock_release(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+
+ if (sk) {
+ sock->sk = NULL;
+ sk->sk_prot->close(sk, 0);
+ }
+
+ return 0;
+}
+
+static int
+lrw_sock_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+ struct sock *sk = sock->sk;
+ struct sockaddr_lorawan *addr = (struct sockaddr_lorawan *)uaddr;
+
+ lrw_dbg("%s: bind address %X\n", __func__, addr->addr_in.devaddr);
+ if (sk->sk_prot->bind)
+ return sk->sk_prot->bind(sk, uaddr, addr_len);
+
+ return sock_no_bind(sock, uaddr, addr_len);
+}
+
+static int
+lrw_sock_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags)
+{
+ struct sock *sk = sock->sk;
+
+ if (addr_len < sizeof(uaddr->sa_family))
+ return -EINVAL;
+
+ return sk->sk_prot->connect(sk, uaddr, addr_len);
+}
+
+static int
+lrw_ndev_ioctl(struct sock *sk, struct ifreq __user *arg, unsigned int cmd)
+{
+ struct ifreq ifr;
+ int ret = -ENOIOCTLCMD;
+ struct net_device *ndev;
+
+ lrw_dbg("%s: cmd %ud\n", __func__, cmd);
+ if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
+ return -EFAULT;
+
+ ifr.ifr_name[IFNAMSIZ-1] = 0;
+
+ dev_load(sock_net(sk), ifr.ifr_name);
+ ndev = dev_get_by_name(sock_net(sk), ifr.ifr_name);
+
+ netdev_dbg(ndev, "%s: cmd %ud\n", __func__, cmd);
+ if (!ndev)
+ return -ENODEV;
+
+ if (ndev->type == ARPHRD_LORAWAN && ndev->netdev_ops->ndo_do_ioctl)
+ ret = ndev->netdev_ops->ndo_do_ioctl(ndev, &ifr, cmd);
+
+ if (!ret && copy_to_user(arg, &ifr, sizeof(struct ifreq)))
+ ret = -EFAULT;
+ dev_put(ndev);
+
+ return ret;
+}
+
+static int
+lrw_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+ struct sock *sk = sock->sk;
+
+ lrw_dbg("%s: cmd %ud\n", __func__, cmd);
+ switch (cmd) {
+ case SIOCGSTAMP:
+ return sock_get_timestamp(sk, (struct timeval __user *)arg);
+ case SIOCGSTAMPNS:
+ return sock_get_timestampns(sk, (struct timespec __user *)arg);
+ case SIOCOUTQ:
+ case SIOCINQ:
+ if (!sk->sk_prot->ioctl)
+ return -ENOIOCTLCMD;
+ return sk->sk_prot->ioctl(sk, cmd, arg);
+ default:
+ return lrw_ndev_ioctl(sk, (struct ifreq __user *)arg, cmd);
+ }
+}
+
+static int
+lrw_sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+{
+ struct sock *sk = sock->sk;
+
+ lrw_dbg("%s: going to send %zu bytes\n", __func__, len);
+ return sk->sk_prot->sendmsg(sk, msg, len);
+}
+
+static const struct proto_ops lrw_dgram_ops = {
+ .family = PF_LORAWAN,
+ .owner = THIS_MODULE,
+ .release = lrw_sock_release,
+ .bind = lrw_sock_bind,
+ .connect = lrw_sock_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = sock_no_accept,
+ .getname = sock_no_getname,
+ .poll = datagram_poll,
+ .ioctl = lrw_sock_ioctl,
+ .listen = sock_no_listen,
+ .shutdown = sock_no_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = lrw_sock_sendmsg,
+ .recvmsg = sock_common_recvmsg,
+ .mmap = sock_no_mmap,
+ .sendpage = sock_no_sendpage,
+};
+
+static int
+lorawan_creat(struct net *net, struct socket *sock, int protocol, int kern)
+{
+ struct sock *sk;
+ int ret;
+
+ if (!net_eq(net, &init_net))
+ return -EAFNOSUPPORT;
+
+ if (sock->type != SOCK_DGRAM)
+ return -EAFNOSUPPORT;
+
+ /* Allocates enough memory for dgram_sock whose first member is sk */
+ sk = sk_alloc(net, PF_LORAWAN, GFP_KERNEL, &lrw_dgram_prot, kern);
+ if (!sk)
+ return -ENOMEM;
+
+ sock->ops = &lrw_dgram_ops;
+ sock_init_data(sock, sk);
+ sk->sk_family = PF_LORAWAN;
+ sock_set_flag(sk, SOCK_ZAPPED);
+
+ if (sk->sk_prot->hash) {
+ ret = sk->sk_prot->hash(sk);
+ if (ret) {
+ sk_common_release(sk);
+ goto lorawan_creat_end;
+ }
+ }
+
+ if (sk->sk_prot->init) {
+ ret = sk->sk_prot->init(sk);
+ if (ret)
+ sk_common_release(sk);
+ }
+
+lorawan_creat_end:
+ return ret;
+}
+
+static const struct net_proto_family lorawan_family_ops = {
+ .owner = THIS_MODULE,
+ .family = PF_LORAWAN,
+ .create = lorawan_creat,
+};
+
+static inline int
+lrw_dgram_deliver(struct net_device *ndev, struct sk_buff *skb)
+{
+ struct sock *sk;
+ struct dgram_sock *ro;
+ bool found = false;
+ int ret = NET_RX_SUCCESS;
+
+ read_lock(&dgram_lock);
+ sk_for_each(sk, &dgram_head) {
+ ro = dgram_sk(sk);
+ if(cpu_to_le32(ro->src_devaddr) == *(__le32 *)ndev->dev_addr) {
+ found = true;
+ break;
+ }
+ }
+ read_unlock(&dgram_lock);
+
+ if (!found)
+ goto lrw_dgram_deliver_err;
+
+ skb = skb_share_check(skb, GFP_ATOMIC);
+ if (!skb)
+ return NET_RX_DROP;
+
+ if (sock_queue_rcv_skb(sk, skb) < 0)
+ goto lrw_dgram_deliver_err;
+
+ return ret;
+
+lrw_dgram_deliver_err:
+ kfree_skb(skb);
+ ret = NET_RX_DROP;
+ return ret;
+}
+
+static int
+lorawan_rcv(struct sk_buff *skb, struct net_device *ndev,
+ struct packet_type *pt, struct net_device *orig_ndev)
+{
+ if (!netif_running(ndev))
+ goto lorawan_rcv_drop;
+
+ if (!net_eq(dev_net(ndev), &init_net))
+ goto lorawan_rcv_drop;
+
+ if (ndev->type != ARPHRD_LORAWAN)
+ goto lorawan_rcv_drop;
+
+ if (skb->pkt_type != PACKET_OTHERHOST)
+ return lrw_dgram_deliver(ndev, skb);
+
+lorawan_rcv_drop:
+ kfree_skb(skb);
+ return NET_RX_DROP;
+}
+
+static struct packet_type lorawan_packet_type = {
+ .type = htons(ETH_P_LORAWAN),
+ .func = lorawan_rcv,
+};
+
+static int __init
+lrw_sock_init(void)
+{
+ int ret;
+
+ lrw_info("module inserted\n");
+ ret = proto_register(&lrw_dgram_prot, 1);
+ if(ret)
+ goto lrw_sock_init_end;
+
+ /* Tell SOCKET that we are alive */
+ ret = sock_register(&lorawan_family_ops);
+ if(ret)
+ goto lrw_sock_init_err;
+
+ dev_add_pack(&lorawan_packet_type);
+ ret = 0;
+ goto lrw_sock_init_end;
+
+lrw_sock_init_err:
+ proto_unregister(&lrw_dgram_prot);
+
+lrw_sock_init_end:
+ return 0;
+}
+
+static void __exit
+lrw_sock_exit(void)
+{
+ dev_remove_pack(&lorawan_packet_type);
+ sock_unregister(PF_LORAWAN);
+ proto_unregister(&lrw_dgram_prot);
+ lrw_info("module removed\n");
+}
+
+module_init(lrw_sock_init);
+module_exit(lrw_sock_exit);
+
+MODULE_AUTHOR("Jian-Hong Pan, <starnight@g.ncu.edu.tw>");
+MODULE_DESCRIPTION("LoRaWAN socket kernel module");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_ALIAS_NETPROTO(PF_LORAWAN);
--
2.19.1
^ permalink raw reply related
* [PATCH V2 5/7] net: maclorawan: Implement the crypto of maclorawan module
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, Marcel Holtmann, linux-kernel, Stefan Schmidt,
Jian-Hong Pan, Dollar Chen, Ken Yu, linux-wpan, David S . Miller,
linux-arm-kernel
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
Implement the crypto for encryption/decryption and message
integrity code (MIC) according to LoRaWAN(TM) Specification Ver. 1.0.2.
Signed-off-by: Jian-Hong Pan <starnight@g.ncu.edu.tw>
---
V2:
- Split the LoRaWAN class module patch in V1 into LoRaWAN socket and
LoRaWAN Soft MAC modules
- Rename the lrwsec files to crypto files
- Modify for Big/Little-Endian
- Use SPDX license identifiers
net/maclorawan/crypto.c | 209 ++++++++++++++++++++++++++++++++++++++++
net/maclorawan/crypto.h | 27 ++++++
2 files changed, 236 insertions(+)
create mode 100644 net/maclorawan/crypto.c
create mode 100644 net/maclorawan/crypto.h
diff --git a/net/maclorawan/crypto.c b/net/maclorawan/crypto.c
new file mode 100644
index 000000000000..a839fd074ad8
--- /dev/null
+++ b/net/maclorawan/crypto.c
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN soft MAC
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#include <linux/scatterlist.h>
+#include <crypto/hash.h>
+#include <crypto/skcipher.h>
+#include "crypto.h"
+
+struct crypto_shash *
+lrw_mic_key_setup(u8 *k, size_t k_len)
+{
+ char *algo = "cmac(aes)";
+ struct crypto_shash *tfm;
+ int err;
+
+ tfm = crypto_alloc_shash(algo, 0, 0);
+ if (!IS_ERR(tfm)) {
+ err = crypto_shash_setkey(tfm, k, k_len);
+ if (err) {
+ crypto_free_shash(tfm);
+ tfm = NULL;
+ }
+ }
+
+ return tfm;
+}
+
+int
+lrw_aes_cmac(struct crypto_shash *tfm, u8 *bz, u8 *data, size_t len, u8 *out)
+{
+ SHASH_DESC_ON_STACK(desc, tfm);
+ int err;
+
+ desc->tfm = tfm;
+
+ err = crypto_shash_init(desc);
+ if (err)
+ goto lrw_aes_cmac_end;
+
+ err = crypto_shash_update(desc, bz, 16);
+ if (err)
+ goto lrw_aes_cmac_end;
+
+ err = crypto_shash_update(desc, data, len);
+ if (err)
+ goto lrw_aes_cmac_end;
+
+ err = crypto_shash_final(desc, out);
+
+lrw_aes_cmac_end:
+ return err;
+}
+
+int
+lrw_set_bzero(u8 dir, u32 devaddr, u32 fcnt, u8 len, u8 *bz)
+{
+ __le32 le_devaddr = cpu_to_le32(devaddr);
+ __le32 _fcnt = cpu_to_le32(fcnt);
+
+ bz[0] = 0x49;
+ memset(bz + 1, 0x00, 4);
+ bz[5] = dir;
+ memcpy(bz + 6, &le_devaddr, 4);
+ memcpy(bz + 10, &_fcnt, 4);
+ bz[14] = 0x00;
+ bz[15] = len;
+
+ return 0;
+}
+
+int
+lrw_calc_mic(struct crypto_shash *tfm,
+ u8 dir, u32 devaddr, u32 fcnt, u8* buf, size_t len, u8 *mic4)
+{
+ u8 mic[16];
+ u8 bz[16];
+ int err;
+
+ /* According to LoRaWAN Specification Version 1.0.2
+ * - 4.4 Massege Integrity Code (MIC) */
+ lrw_set_bzero(dir, devaddr, fcnt, len, bz);
+ err = lrw_aes_cmac(tfm, bz, buf, len, mic);
+ if (!err)
+ memcpy(mic4, mic, 4);
+
+ return err;
+}
+
+void
+lrw_mic_key_free(struct crypto_shash *tfm)
+{
+ crypto_free_shash(tfm);
+}
+
+struct crypto_skcipher *
+lrw_aes_enc_key_setup(char *algo, u8 *k, size_t k_len)
+{
+ struct crypto_skcipher *tfm;
+ int err;
+
+ tfm = crypto_alloc_skcipher(algo, 0, CRYPTO_ALG_ASYNC);
+ if (!IS_ERR(tfm)) {
+ err = crypto_skcipher_setkey(tfm, k, k_len);
+ if (err) {
+ crypto_free_skcipher(tfm);
+ tfm = NULL;
+ }
+ }
+
+ return tfm;
+}
+
+struct crypto_skcipher *
+lrw_encrypt_key_setup(u8 *k, size_t k_len)
+{
+ return lrw_aes_enc_key_setup("cbc(aes)", k, k_len);
+}
+
+int
+lrw_aes_enc(struct crypto_skcipher *tfm, u8 *in, size_t len, u8 *out)
+{
+ u8 iv[16];
+ struct scatterlist src, dst;
+ SKCIPHER_REQUEST_ON_STACK(req, tfm);
+ int err;
+
+ memset(iv, 0, 16);
+ /* The buffer for sg_init_one cannot be a global or const local
+ * (will confuse the scatterlist) */
+ sg_init_one(&src, in, len);
+ sg_init_one(&dst, out, len);
+
+ skcipher_request_set_tfm(req, tfm);
+ skcipher_request_set_callback(req, 0, NULL, NULL);
+ skcipher_request_set_crypt(req, &src, &dst, len, iv);
+ err = crypto_skcipher_encrypt(req);
+ skcipher_request_zero(req);
+
+ return err;
+}
+
+#define LRW_SEQUENCE_OF_BLOCK_LEN 16
+
+int
+lrw_set_sob(u8 dir, u32 devaddr, u32 fcnt, u8 index, u8 *sob)
+{
+ __le32 le_devaddr = cpu_to_le32(devaddr);
+ __le32 _fcnt = cpu_to_le32(fcnt);
+
+ sob[0] = 0x01;
+ memset(sob + 1, 0x00, 4);
+ sob[5] = dir;
+ memcpy(sob + 6, &le_devaddr, 4);
+ memcpy(sob + 10, &_fcnt, 4);
+ sob[14] = 0x00;
+ sob[15] = index;
+
+ return 0;
+}
+
+int
+lrw_encrypt_sob(struct crypto_skcipher *tfm, u8 *sob)
+{
+ return lrw_aes_enc(tfm, sob, LRW_SEQUENCE_OF_BLOCK_LEN, sob);
+}
+
+int
+lrw_encrypt_buf(struct crypto_skcipher *tfm,
+ u8 dir, u32 devaddr, u32 fcnt, u8 *buf, size_t len)
+{
+ u8 sob[LRW_SEQUENCE_OF_BLOCK_LEN];
+ u8 i, j;
+
+ /* According to LoRaWAN Specification Version 1.0.2
+ * - 4.3.3 MAC Frame Payload Encryption (FRMPayload) */
+ for (i = 0; (i * LRW_SEQUENCE_OF_BLOCK_LEN) < len; i++) {
+ lrw_set_sob(dir, devaddr, fcnt, i, sob);
+ lrw_encrypt_sob(tfm, sob);
+ for (j = 0; (i * LRW_SEQUENCE_OF_BLOCK_LEN + j) < len; j++)
+ buf[i * LRW_SEQUENCE_OF_BLOCK_LEN + j] ^= sob[j];
+ }
+
+ return 0;
+}
+
+int
+lrw_decrypt_buf(struct crypto_skcipher *tfm,
+ u8 dir, u32 devaddr, u32 fcnt, u8 *buf, size_t len)
+{
+ /* Accoding to XOR swap algorithm */
+ return lrw_encrypt_buf(tfm, dir, devaddr, fcnt, buf, len);
+}
+
+void
+lrw_aes_enc_key_free(struct crypto_skcipher *tfm)
+{
+ crypto_free_skcipher(tfm);
+}
+
+void
+lrw_encrypt_key_free(struct crypto_skcipher *tfm)
+{
+ lrw_aes_enc_key_free(tfm);
+}
diff --git a/net/maclorawan/crypto.h b/net/maclorawan/crypto.h
new file mode 100644
index 000000000000..2ede02efb8c6
--- /dev/null
+++ b/net/maclorawan/crypto.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN soft MAC
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#ifndef __LORAWAN_CRYPTO_H__
+#define __LORAWAN_CRYPTO_H__
+
+#include <crypto/hash.h>
+#include <crypto/skcipher.h>
+
+struct crypto_shash *lrw_mic_key_setup(u8 *k, size_t k_len);
+int lrw_calc_mic(struct crypto_shash *tfm,
+ u8 dir, u32 devaddr, u32 fcnt, u8* buf, size_t len, u8 *mic4);
+void lrw_mic_key_free(struct crypto_shash *tfm);
+
+struct crypto_skcipher *lrw_encrypt_key_setup(u8 *k, size_t k_len);
+int lrw_encrypt_buf(struct crypto_skcipher *tfm,
+ u8 dir, u32 devaddr, u32 fcnt, u8 *buf, size_t len);
+int lrw_decrypt_buf(struct crypto_skcipher *tfm,
+ u8 dir, u32 devaddr, u32 fcnt, u8 *buf, size_t len);
+void lrw_encrypt_key_free(struct crypto_skcipher *tfm);
+
+#endif
--
2.19.1
^ permalink raw reply related
* [PATCH V2 7/7] net: lorawan: List LORAWAN in menuconfig
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, linux-arm-kernel, linux-kernel, Marcel Holtmann,
David S . Miller, Dollar Chen, Ken Yu, linux-wpan, Stefan Schmidt,
Jian-Hong Pan
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
List LORAWAN and MACLORAWAN in menuconfig and enable them to be built.
Signed-off-by: Jian-Hong Pan <starnight@g.ncu.edu.tw>
---
V2:
- Split the LoRaWAN class module patch in V1 into LoRaWAN socket and
LoRaWAN Soft MAC modules
net/Kconfig | 2 ++
net/Makefile | 2 ++
2 files changed, 4 insertions(+)
diff --git a/net/Kconfig b/net/Kconfig
index 053b36998c18..b12b8bed6abb 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -224,6 +224,8 @@ source "net/6lowpan/Kconfig"
source "net/ieee802154/Kconfig"
source "net/mac802154/Kconfig"
source "net/lora/Kconfig"
+source "net/lorawan/Kconfig"
+source "net/maclorawan/Kconfig"
source "net/sched/Kconfig"
source "net/dcb/Kconfig"
source "net/dns_resolver/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index e80b84313851..9d5515965a8f 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -63,6 +63,8 @@ obj-$(CONFIG_6LOWPAN) += 6lowpan/
obj-$(CONFIG_IEEE802154) += ieee802154/
obj-$(CONFIG_MAC802154) += mac802154/
obj-$(CONFIG_LORA) += lora/
+obj-$(CONFIG_LORAWAN) += lorawan/
+obj-$(CONFIG_MACLORAWAN) += maclorawan/
ifeq ($(CONFIG_NET),y)
obj-$(CONFIG_SYSCTL) += sysctl_net.o
--
2.19.1
^ permalink raw reply related
* [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: jiangyiwen @ 2018-11-05 7:43 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Now vsock only support send/receive small packet, it can't achieve
high performance. As previous discussed with Jason Wang, I revisit the
idea of vhost-net about mergeable rx buffer and implement the mergeable
rx buffer in vhost-vsock, it can allow big packet to be scattered in
into different buffers and improve performance obviously.
I write a tool to test the vhost-vsock performance, mainly send big
packet(64K) included guest->Host and Host->Guest. The result as
follows:
Before performance:
Single socket Multiple sockets(Max Bandwidth)
Guest->Host ~400MB/s ~480MB/s
Host->Guest ~1450MB/s ~1600MB/s
After performance:
Single socket Multiple sockets(Max Bandwidth)
Guest->Host ~1700MB/s ~2900MB/s
Host->Guest ~1700MB/s ~2900MB/s
>From the test results, the performance is improved obviously, and guest
memory will not be wasted.
---
Yiwen Jiang (5):
VSOCK: support fill mergeable rx buffer in guest
VSOCK: support fill data to mergeable rx buffer in host
VSOCK: support receive mergeable rx buffer in guest
VSOCK: modify default rx buf size to improve performance
VSOCK: batch sending rx buffer to increase bandwidth
drivers/vhost/vsock.c | 135 +++++++++++++++++++++++------
include/linux/virtio_vsock.h | 15 +++-
include/uapi/linux/virtio_vsock.h | 5 ++
net/vmw_vsock/virtio_transport.c | 147 ++++++++++++++++++++++++++------
net/vmw_vsock/virtio_transport_common.c | 59 +++++++++++--
5 files changed, 300 insertions(+), 61 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH 1/5] VSOCK: support fill mergeable rx buffer in guest
From: jiangyiwen @ 2018-11-05 7:45 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
In driver probing, if virtio has VIRTIO_VSOCK_F_MRG_RXBUF feature,
it will fill mergeable rx buffer, support for host send mergeable
rx buffer. It will fill a page everytime to compact with small
packet and big packet.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
include/linux/virtio_vsock.h | 3 ++
net/vmw_vsock/virtio_transport.c | 72 +++++++++++++++++++++++++++++-----------
2 files changed, 56 insertions(+), 19 deletions(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index e223e26..bf84418 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -14,6 +14,9 @@
#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
+/* Virtio-vsock feature */
+#define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
+
enum {
VSOCK_VQ_RX = 0, /* for host to guest data */
VSOCK_VQ_TX = 1, /* for guest to host data */
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 5d3cce9..2040a9e 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -64,6 +64,7 @@ struct virtio_vsock {
struct virtio_vsock_event event_list[8];
u32 guest_cid;
+ bool mergeable;
};
static struct virtio_vsock *virtio_vsock_get(void)
@@ -256,6 +257,25 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
return 0;
}
+static int fill_mergeable_rx_buff(struct virtqueue *vq)
+{
+ void *page = NULL;
+ struct scatterlist sg;
+ int err;
+
+ page = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+
+ sg_init_one(&sg, page, PAGE_SIZE);
+
+ err = virtqueue_add_inbuf(vq, &sg, 1, page, GFP_KERNEL);
+ if (err < 0)
+ free_page((unsigned long) page);
+
+ return err;
+}
+
static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
{
int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
@@ -267,27 +287,33 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
vq = vsock->vqs[VSOCK_VQ_RX];
do {
- pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
- if (!pkt)
- break;
+ if (vsock->mergeable) {
+ ret = fill_mergeable_rx_buff(vq);
+ if (ret)
+ break;
+ } else {
+ pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+ if (!pkt)
+ break;
- pkt->buf = kmalloc(buf_len, GFP_KERNEL);
- if (!pkt->buf) {
- virtio_transport_free_pkt(pkt);
- break;
- }
+ pkt->buf = kmalloc(buf_len, GFP_KERNEL);
+ if (!pkt->buf) {
+ virtio_transport_free_pkt(pkt);
+ break;
+ }
- pkt->len = buf_len;
+ pkt->len = buf_len;
- sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
- sgs[0] = &hdr;
+ sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+ sgs[0] = &hdr;
- sg_init_one(&buf, pkt->buf, buf_len);
- sgs[1] = &buf;
- ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
- if (ret) {
- virtio_transport_free_pkt(pkt);
- break;
+ sg_init_one(&buf, pkt->buf, buf_len);
+ sgs[1] = &buf;
+ ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
+ if (ret) {
+ virtio_transport_free_pkt(pkt);
+ break;
+ }
}
vsock->rx_buf_nr++;
} while (vq->num_free);
@@ -588,6 +614,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
if (ret < 0)
goto out_vqs;
+ if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_MRG_RXBUF))
+ vsock->mergeable = true;
+
vsock->rx_buf_nr = 0;
vsock->rx_buf_max_nr = 0;
atomic_set(&vsock->queued_replies, 0);
@@ -640,8 +669,12 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
vdev->config->reset(vdev);
mutex_lock(&vsock->rx_lock);
- while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
- virtio_transport_free_pkt(pkt);
+ while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX]))) {
+ if (vsock->mergeable)
+ free_page((unsigned long)(void *)pkt);
+ else
+ virtio_transport_free_pkt(pkt);
+ }
mutex_unlock(&vsock->rx_lock);
mutex_lock(&vsock->tx_lock);
@@ -683,6 +716,7 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
};
static unsigned int features[] = {
+ VIRTIO_VSOCK_F_MRG_RXBUF,
};
static struct virtio_driver virtio_vsock_driver = {
--
1.8.3.1
^ permalink raw reply related
* [PATCH 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-11-05 7:45 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
it will merge big packet into rx vq.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
drivers/vhost/vsock.c | 117 +++++++++++++++++++++++++++++++-------
include/linux/virtio_vsock.h | 1 +
include/uapi/linux/virtio_vsock.h | 5 ++
3 files changed, 102 insertions(+), 21 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab..648be39 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -22,7 +22,8 @@
#define VHOST_VSOCK_DEFAULT_HOST_CID 2
enum {
- VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+ VHOST_VSOCK_FEATURES = VHOST_FEATURES |
+ (1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
};
/* Used to track all the vhost_vsock instances on the system. */
@@ -80,6 +81,68 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
return vsock;
}
+static int get_rx_bufs(struct vhost_virtqueue *vq,
+ struct vring_used_elem *heads, int datalen,
+ unsigned *iovcount, unsigned int quota)
+{
+ unsigned int out, in;
+ int seg = 0;
+ int headcount = 0;
+ unsigned d;
+ int ret;
+ /*
+ * len is always initialized before use since we are always called with
+ * datalen > 0.
+ */
+ u32 uninitialized_var(len);
+
+ while (datalen > 0 && headcount < quota) {
+ if (unlikely(seg >= UIO_MAXIOV)) {
+ ret = -ENOBUFS;
+ goto err;
+ }
+
+ ret = vhost_get_vq_desc(vq, vq->iov + seg,
+ ARRAY_SIZE(vq->iov) - seg, &out,
+ &in, NULL, NULL);
+ if (unlikely(ret < 0))
+ goto err;
+
+ d = ret;
+ if (d == vq->num) {
+ ret = 0;
+ goto err;
+ }
+
+ if (unlikely(out || in <= 0)) {
+ vq_err(vq, "unexpected descriptor format for RX: "
+ "out %d, in %d\n", out, in);
+ ret = -EINVAL;
+ goto err;
+ }
+
+ heads[headcount].id = cpu_to_vhost32(vq, d);
+ len = iov_length(vq->iov + seg, in);
+ heads[headcount].len = cpu_to_vhost32(vq, len);
+ datalen -= len;
+ ++headcount;
+ seg += in;
+ }
+
+ heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+ *iovcount = seg;
+
+ /* Detect overrun */
+ if (unlikely(datalen > 0)) {
+ ret = UIO_MAXIOV + 1;
+ goto err;
+ }
+ return headcount;
+err:
+ vhost_discard_vq_desc(vq, headcount);
+ return ret;
+}
+
static void
vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct vhost_virtqueue *vq)
@@ -87,22 +150,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
bool added = false;
bool restart_tx = false;
+ int mergeable;
+ size_t vsock_hlen;
mutex_lock(&vq->mutex);
if (!vq->private_data)
goto out;
+ mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
+ /*
+ * Guest fill page for rx vq in mergeable case, so it will not
+ * allocate pkt structure, we should reserve size of pkt in advance.
+ */
+ if (likely(mergeable))
+ vsock_hlen = sizeof(struct virtio_vsock_pkt);
+ else
+ vsock_hlen = sizeof(struct virtio_vsock_hdr);
+
/* Avoid further vmexits, we're already processing the virtqueue */
vhost_disable_notify(&vsock->dev, vq);
for (;;) {
struct virtio_vsock_pkt *pkt;
struct iov_iter iov_iter;
- unsigned out, in;
+ unsigned out = 0, in = 0;
size_t nbytes;
size_t len;
- int head;
+ s16 headcount;
spin_lock_bh(&vsock->send_pkt_list_lock);
if (list_empty(&vsock->send_pkt_list)) {
@@ -116,16 +191,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
- head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- &out, &in, NULL, NULL);
- if (head < 0) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
- break;
- }
-
- if (head == vq->num) {
+ headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
+ &in, likely(mergeable) ? UIO_MAXIOV : 1);
+ if (headcount <= 0) {
spin_lock_bh(&vsock->send_pkt_list_lock);
list_add(&pkt->list, &vsock->send_pkt_list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
@@ -133,19 +201,13 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
/* We cannot finish yet if more buffers snuck in while
* re-enabling notify.
*/
- if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+ if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
vhost_disable_notify(&vsock->dev, vq);
continue;
}
break;
}
- if (out) {
- virtio_transport_free_pkt(pkt);
- vq_err(vq, "Expected 0 output buffers, got %u\n", out);
- break;
- }
-
len = iov_length(&vq->iov[out], in);
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
@@ -156,6 +218,19 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
break;
}
+ if (likely(mergeable)) {
+ pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
+ nbytes = copy_to_iter(&pkt->mrg_rxbuf_hdr,
+ sizeof(pkt->mrg_rxbuf_hdr), &iov_iter);
+ if (nbytes != sizeof(pkt->mrg_rxbuf_hdr)) {
+ virtio_transport_free_pkt(pkt);
+ vq_err(vq, "Faulted on copying rxbuf hdr\n");
+ break;
+ }
+ iov_iter_advance(&iov_iter, (vsock_hlen -
+ sizeof(pkt->mrg_rxbuf_hdr) - sizeof(pkt->hdr)));
+ }
+
nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
if (nbytes != pkt->len) {
virtio_transport_free_pkt(pkt);
@@ -163,7 +238,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
break;
}
- vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+ vhost_add_used_n(vq, vq->heads, headcount);
added = true;
if (pkt->reply) {
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index bf84418..da9e1fe 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -50,6 +50,7 @@ struct virtio_vsock_sock {
struct virtio_vsock_pkt {
struct virtio_vsock_hdr hdr;
+ struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
struct work_struct work;
struct list_head list;
/* socket refcnt not held, only use for cancellation */
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 1d57ed3..2292f30 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
__le32 fwd_cnt;
} __attribute__((packed));
+/* It add mergeable rx buffers feature */
+struct virtio_vsock_mrg_rxbuf_hdr {
+ __le16 num_buffers; /* number of mergeable rx buffers */
+} __attribute__((packed));
+
enum virtio_vsock_type {
VIRTIO_VSOCK_TYPE_STREAM = 1,
};
--
1.8.3.1
^ permalink raw reply related
* [PATCH 4/5] VSOCK: modify default rx buf size to improve performance
From: jiangyiwen @ 2018-11-05 7:47 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Since VSOCK already support mergeable rx buffer, so it can
implement the balance with performance and guest memory,
we can increase the default rx buffer size to improve
performance.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
include/linux/virtio_vsock.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 6be3cd7..594e720 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -10,7 +10,7 @@
#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE 128
#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE (1024 * 256)
#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE (1024 * 256)
-#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 64)
#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
--
1.8.3.1
^ permalink raw reply related
* [PATCH 5/5] VSOCK: batch sending rx buffer to increase bandwidth
From: jiangyiwen @ 2018-11-05 7:48 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Batch sending rx buffer can improve total bandwidth.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
drivers/vhost/vsock.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 648be39..a587ddc 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -148,10 +148,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
struct vhost_virtqueue *vq)
{
struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
- bool added = false;
bool restart_tx = false;
int mergeable;
size_t vsock_hlen;
+ int batch_count = 0;
+
+#define VHOST_VSOCK_BATCH 16
mutex_lock(&vq->mutex);
@@ -191,8 +193,9 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
- headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
- &in, likely(mergeable) ? UIO_MAXIOV : 1);
+ headcount = get_rx_bufs(vq, vq->heads + batch_count,
+ vsock_hlen + pkt->len, &in,
+ likely(mergeable) ? UIO_MAXIOV : 1);
if (headcount <= 0) {
spin_lock_bh(&vsock->send_pkt_list_lock);
list_add(&pkt->list, &vsock->send_pkt_list);
@@ -238,8 +241,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
break;
}
- vhost_add_used_n(vq, vq->heads, headcount);
- added = true;
+ batch_count += headcount;
+ if (batch_count > VHOST_VSOCK_BATCH) {
+ vhost_add_used_and_signal_n(&vsock->dev, vq,
+ vq->heads, batch_count);
+ batch_count = 0;
+ }
if (pkt->reply) {
int val;
@@ -258,8 +265,11 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
virtio_transport_free_pkt(pkt);
}
- if (added)
- vhost_signal(&vsock->dev, vq);
+
+ if (batch_count) {
+ vhost_add_used_and_signal_n(&vsock->dev, vq,
+ vq->heads, batch_count);
+ }
out:
mutex_unlock(&vq->mutex);
--
1.8.3.1
^ permalink raw reply related
* Forgive my intentions if this email comes to you as a surprise,
From: MR JHORGE JAMES @ 2018-11-05 7:56 UTC (permalink / raw)
Dear friend,
Forgive my indignation if this message comes to you as a surprise. I
got your contact When i was searching for a foreign reliable partner I
am formal Director Central bank of Gabon, Presently i work with UNITED
BANK FOR AFRICA (BOA) ECO-WAS as telex managing Director.
bank (B.O.A). In my department we discovered an abandoned sum of $37.5
million U.S.A dollars. In an account that belongs to one of our
foreign customer who died along with all his family in the Asia Earth
Quake Disaster(TSUNAMI DISASTER INDONESIA / INDIA.
Since we got information about his death, unfortunately i learn that
all his supposed next of kin or relation died along side leaving
nobody behind for the claim. In respect to the provision of a foreign
account ($15 million dollars) for you and ($20 million dollars) for
me. Then we give the remain ($2.5 million dollars) to orphanage.
There after i will visit your country for disbursement according to
the percentages indicated.
(FILL THIS FORM BELLOW PLEASE AND RESEND IT TO ME).
1) Your Full Name
2) Your Age
3) Marital Status
4) Your Cell Phone Number
5) Your Fax Number
6) Your Country
7) Your Occupation
8) Sex
9) Your Religion
for security reasons You have to keep everything secret as to enable
the transfer to move very smoothly in to the account you will prove to
the bank. I am waiting for your immediate response as you receive this
mail. Extend my sincere greetings to your entire family. God bless you
and bye for now.
this is my private email you can contact me on (jhor6767@outlook.com).
Thanks for your maximum co-operation,
Yours Sincerely,
^ permalink raw reply
* [PATCH] xfrm: Fix bucket count reported to userspace
From: Benjamin Poirier @ 2018-11-05 8:00 UTC (permalink / raw)
To: Steffen Klassert, Jamal Hadi Salim; +Cc: Herbert Xu, David S. Miller, netdev
sadhcnt is reported by `ip -s xfrm state count` as "buckets count", not the
hash mask.
Fixes: 28d8909bc790 ("[XFRM]: Export SAD info.")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
net/xfrm/xfrm_state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index b669262682c9..12cdb350c456 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -788,7 +788,7 @@ void xfrm_sad_getinfo(struct net *net, struct xfrmk_sadinfo *si)
{
spin_lock_bh(&net->xfrm.xfrm_state_lock);
si->sadcnt = net->xfrm.state_num;
- si->sadhcnt = net->xfrm.state_hmask;
+ si->sadhcnt = net->xfrm.state_hmask + 1;
si->sadhmcnt = xfrm_state_hashmax;
spin_unlock_bh(&net->xfrm.xfrm_state_lock);
}
--
2.19.0
^ permalink raw reply related
* Re: [PATCH 0/5] Use common cordic algorithm for b43
From: Arend van Spriel @ 2018-11-05 8:17 UTC (permalink / raw)
To: Kalle Valo, Priit Laes
Cc: linux-wireless, b43-dev, netdev, linux-kernel,
brcm80211-dev-list.pdl, brcm80211-dev-list
In-Reply-To: <87muqoar5i.fsf@purkki.adurom.net>
On 11/5/2018 9:02 AM, Kalle Valo wrote:
> Also I don't see MAINTAINERS entry for cordic.[c|h], that would be good
> to have as well.
We added the cordic library functions during brcm80211 staging cleanup.
We can add it to MAINTAINERS file.
Regards,
Arend
^ permalink raw reply
* [PATCH] net: alx: make alx_drv_name static
From: Rasmus Villemoes @ 2018-11-05 17:52 UTC (permalink / raw)
To: Jay Cliburn, Chris Snook, David S. Miller
Cc: Rasmus Villemoes, netdev, linux-kernel
alx_drv_name is not used outside main.c, so there's no reason for it to
have external linkage.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
drivers/net/ethernet/atheros/alx/alx.h | 1 -
drivers/net/ethernet/atheros/alx/main.c | 2 +-
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/atheros/alx/alx.h b/drivers/net/ethernet/atheros/alx/alx.h
index 78c5de467426..9d0e74f6b089 100644
--- a/drivers/net/ethernet/atheros/alx/alx.h
+++ b/drivers/net/ethernet/atheros/alx/alx.h
@@ -140,6 +140,5 @@ struct alx_priv {
};
extern const struct ethtool_ops alx_ethtool_ops;
-extern const char alx_drv_name[];
#endif
diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c
index 7968c644ad86..c131cfc1b79d 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -49,7 +49,7 @@
#include "hw.h"
#include "reg.h"
-const char alx_drv_name[] = "alx";
+static const char alx_drv_name[] = "alx";
static void alx_free_txbuf(struct alx_tx_queue *txq, int entry)
{
--
2.19.1.6.gbde171bbf5
^ permalink raw reply related
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Tariq Toukan @ 2018-11-05 8:42 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Aaron Lu
Cc: Saeed Mahameed, pstaszewski@itcare.pl, eric.dumazet@gmail.com,
netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
yoel@kviknet.dk, mgorman@techsingularity.net
In-Reply-To: <20181103135325.01a7b5d6@redhat.com>
On 03/11/2018 2:53 PM, Jesper Dangaard Brouer wrote:
>
> On Fri, 2 Nov 2018 22:20:24 +0800 Aaron Lu <aaron.lu@intel.com> wrote:
>
>> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
>>> On Fri, 2 Nov 2018 13:23:56 +0800
>>> Aaron Lu <aaron.lu@intel.com> wrote:
>>>
>>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
>>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
>>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
>>>>>> wrote:
>>>>>> ... ...
>>>>>>> Section copied out:
>>>>>>>
>>>>>>> mlx5e_poll_tx_cq
>>>>>>> |
>>>>>>> --16.34%--napi_consume_skb
>>>>>>> |
>>>>>>> |--12.65%--__free_pages_ok
>>>>>>> | |
>>>>>>> | --11.86%--free_one_page
>>>>>>> | |
>>>>>>> | |--10.10%
>>>>>>> --queued_spin_lock_slowpath
>>>>>>> | |
>>>>>>> | --0.65%--_raw_spin_lock
>>>>>>
>>>>>> This callchain looks like it is freeing higher order pages than order
>>>>>> 0:
>>>>>> __free_pages_ok is only called for pages whose order are bigger than
>>>>>> 0.
>>>>>
>>>>> mlx5 rx uses only order 0 pages, so i don't know where these high order
>>>>> tx SKBs are coming from..
>>>>
>>>> Perhaps here:
>>>> __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
>>>> __napi_alloc_frag() will all call page_frag_alloc(), which will use
>>>> __page_frag_cache_refill() to get an order 3 page if possible, or fall
>>>> back to an order 0 page if order 3 page is not available.
>>>>
>>>> I'm not sure if your workload will use the above code path though.
>>>
>>> TL;DR: this is order-0 pages (code-walk trough proof below)
>>>
>>> To Aaron, the network stack *can* call __free_pages_ok() with order-0
>>> pages, via:
>>>
>>> static void skb_free_head(struct sk_buff *skb)
>>> {
>>> unsigned char *head = skb->head;
>>>
>>> if (skb->head_frag)
>>> skb_free_frag(head);
>>> else
>>> kfree(head);
>>> }
>>>
>>> static inline void skb_free_frag(void *addr)
>>> {
>>> page_frag_free(addr);
>>> }
>>>
>>> /*
>>> * Frees a page fragment allocated out of either a compound or order 0 page.
>>> */
>>> void page_frag_free(void *addr)
>>> {
>>> struct page *page = virt_to_head_page(addr);
>>>
>>> if (unlikely(put_page_testzero(page)))
>>> __free_pages_ok(page, compound_order(page));
>>> }
>>> EXPORT_SYMBOL(page_frag_free);
>>
>> I think here is a problem - order 0 pages are freed directly to buddy,
>> bypassing per-cpu-pages. This might be the reason lock contention
>> appeared on free path.
>
> OMG - you just found a significant issue with the network stacks
> interaction with the page allocator! This explains why I could not get
> the PCP (Per-Cpu-Pages) system to have good performance, in my
> performance networking benchmarks. As we are basically only using the
> alloc side of PCP, and not the free side.
> We have spend years adding different driver level recycle tricks to
> avoid this code path getting activated, exactly because it is rather
> slow and problematic that we hit this zone->lock.
>
Oh! It has been behaving this way for too long.
Good catch!
>> Can someone apply below diff and see if lock contention is gone?
>
> I have also applied and tested this patch, and yes the lock contention
> is gone. As mentioned is it rather difficult to hit this code path, as
> the driver page recycle mechanism tries to hide/avoid it, but mlx5 +
> page_pool + CPU-map recycling have a known weakness that bypass the
> driver page recycle scheme (that I've not fixed yet). I observed a 7%
> speedup for this micro benchmark.
>
Great news. I also have a benchmark that uses orde-r0 pages and stresses
the zone-lock. I'll test your patch during this week.
>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e2ef1c17942f..65c0ae13215a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>> {
>> struct page *page = virt_to_head_page(addr);
>>
>> - if (unlikely(put_page_testzero(page)))
>> - __free_pages_ok(page, compound_order(page));
>> + if (unlikely(put_page_testzero(page))) {
>> + unsigned int order = compound_order(page);
>> +
>> + if (order == 0)
>> + free_unref_page(page);
>> + else
>> + __free_pages_ok(page, order);
>> + }
>> }
>> EXPORT_SYMBOL(page_frag_free);
>
> Thank you Aaron for spotting this!!!
>
Thanks Aaron :) !!
Does it conflict with your recent work that optimizes order-0 allocation?
^ permalink raw reply
* RE: [PATCH net-next 5/6] net/ncsi: Reset channel state in ncsi_start_dev()
From: Justin.Lee1 @ 2018-11-05 18:01 UTC (permalink / raw)
To: sam, netdev; +Cc: davem, linux-kernel, openbmc
In-Reply-To: <c4f0fdcc971ca258539899a8b15755b96b2353f5.camel@mendozajonas.com>
> On Tue, 2018-10-30 at 21:26 +0000, Justin.Lee1@Dell.com wrote:
> > > +int ncsi_reset_dev(struct ncsi_dev *nd)
> > > +{
> > > + struct ncsi_dev_priv *ndp = TO_NCSI_DEV_PRIV(nd);
> > > + struct ncsi_channel *nc, *active;
> > > + struct ncsi_package *np;
> > > + unsigned long flags;
> > > + bool enabled;
> > > + int state;
> > > +
> > > + active = NULL;
> > > + NCSI_FOR_EACH_PACKAGE(ndp, np) {
> > > + NCSI_FOR_EACH_CHANNEL(np, nc) {
> > > + spin_lock_irqsave(&nc->lock, flags);
> > > + enabled = nc->monitor.enabled;
> > > + state = nc->state;
> > > + spin_unlock_irqrestore(&nc->lock, flags);
> > > +
> > > + if (enabled)
> > > + ncsi_stop_channel_monitor(nc);
> > > + if (state == NCSI_CHANNEL_ACTIVE) {
> > > + active = nc;
> > > + break;
> >
> > Is the original intention to process the channel one by one?
> > If it is the case, there are two loops and we might need to use
> > "goto found" instead.
>
> Yes we'll need to break out of the package loop here as well.
>
> >
> > > + }
> > > + }
> > > + }
> > > +
> >
> > found: ?
> >
> > > + if (!active) {
> > > + /* Done */
> > > + spin_lock_irqsave(&ndp->lock, flags);
> > > + ndp->flags &= ~NCSI_DEV_RESET;
> > > + spin_unlock_irqrestore(&ndp->lock, flags);
> > > + return ncsi_choose_active_channel(ndp);
> > > + }
> > > +
> > > + spin_lock_irqsave(&ndp->lock, flags);
> > > + ndp->flags |= NCSI_DEV_RESET;
> > > + ndp->active_channel = active;
> > > + ndp->active_package = active->package;
> > > + spin_unlock_irqrestore(&ndp->lock, flags);
> > > +
> > > + nd->state = ncsi_dev_state_suspend;
> > > + schedule_work(&ndp->work);
> > > + return 0;
> > > +}
> >
> > Also similar issue in ncsi_choose_active_channel() function below.
> >
> > > @@ -916,32 +1045,49 @@ static int ncsi_choose_active_channel(struct ncsi_dev_priv *ndp)
> > >
> > > ncm = &nc->modes[NCSI_MODE_LINK];
> > > if (ncm->data[2] & 0x1) {
> > > - spin_unlock_irqrestore(&nc->lock, flags);
> > > found = nc;
> > > - goto out;
> > > + with_link = true;
> > > }
> > >
> > > - spin_unlock_irqrestore(&nc->lock, flags);
> > > + /* If multi_channel is enabled configure all valid
> > > + * channels whether or not they currently have link
> > > + * so they will have AENs enabled.
> > > + */
> > > + if (with_link || np->multi_channel) {
> >
> > I notice that there is a case that we will misconfigure the interface.
> > For example below, multi-channel is not enable for package 1.
> > But we enable the channel for ncsi2 below (package 1 channel 0) as that interface is the first
> > channel for that package with link.
>
> I don't think I see the issue here; multi-channel is not set on package
> 1, but both channels are in the channel whitelist. Channel 0 is
> configured since it's the first found on package 1, and channel 1 is not
> since channel 0 is already found. Are you expecting something different?
>
The setting is that multi-package is enable for both package 0 and 1.
Multi-channel is only enabled for package 0.
> >
> > cat /sys/kernel/debug/ncsi_protocol/ncsi_device_
> > IFIDX IFNAME NAME PID CID RX TX MP MC WP WC PC CS PS LS RU CR NQ HA
> > =====================================================================
> > 2 eth2 ncsi0 000 000 1 1 1 1 1 1 0 2 1 1 1 1 0 1
> > 2 eth2 ncsi1 000 001 1 0 1 1 1 1 0 2 1 1 1 1 0 1
> > 2 eth2 ncsi2 001 000 1 0 1 0 1 1 0 2 1 1 1 1 0 1
I was replying to the wrong old email and it might cause a bit confusion.
The first 1 meaning channel is enabled for package 1 channel 0 (ncsi2).
For eth2, we already has ncsi0 as the active channel with TX enable.
I would think that package doesn't have the multi-channel enabled and
we should not enable the channel for ncsi2. The problem is that package 1 doesn't
enable the multi-channel and it believes it needs to enable one channel for its package
but it doesn't aware that the other package already has one active channel.
> > 2 eth2 ncsi3 001 001 0 0 1 0 1 1 0 1 0 1 1 1 0 1
> > =====================================================================
> > MP: Multi-mode Package WP: Whitelist Package
> > MC: Multi-mode Channel WC: Whitelist Channel
> > PC: Primary Channel CS: Channel State IA/A/IV 1/2/3
> > PS: Poll Status LS: Link Status
> > RU: Running CR: Carrier OK
> > NQ: Queue Stopped HA: Hardware Arbitration
> >
> > I temporally change to the following to avoid that.
> > if ((with_link &&
> > !np->multi_channel &&
> > list_empty(&ndp->channel_queue)) || np->multi_channel) {
> >
> > > + spin_lock_irqsave(&ndp->lock, flags);
> > > + list_add_tail_rcu(&nc->link,
> > > + &ndp->channel_queue);
> > > + spin_unlock_irqrestore(&ndp->lock, flags);
> > > +
> > > + netdev_dbg(ndp->ndev.dev,
> > > + "NCSI: Channel %u added to queue (link %s)\n",
> > > + nc->id,
> > > + ncm->data[2] & 0x1 ? "up" : "down");
> > > + }
> > > +
> > > + spin_unlock_irqrestore(&nc->lock, cflags);
> > > +
> > > + if (with_link && !np->multi_channel)
> > > + break;
> >
> > Similar issue here. As we are using break, so each package will configure one active TX.
> >
>
> I believe this is handled properly in ncsi_channel_is_tx() in the most
> recent revision.
I saw this issue with the last revision. I was using the wrong email to reply.
>
> > > }
> > > + if (with_link && !ndp->multi_package)
> > > + break;
> > > }
> > >
> > > - if (!found) {
> > > + if (list_empty(&ndp->channel_queue) && found) {
> > > + netdev_info(ndp->ndev.dev,
> > > + "NCSI: No channel with link found, configuring channel %u\n",
> > > + found->id);
> > > + spin_lock_irqsave(&ndp->lock, flags);
> > > + list_add_tail_rcu(&found->link, &ndp->channel_queue);
> > > + spin_unlock_irqrestore(&ndp->lock, flags);
> > > + } else if (!found) {
> > > netdev_warn(ndp->ndev.dev,
> > > - "NCSI: No channel found with link\n");
> > > + "NCSI: No channel found to configure!\n");
> > > ncsi_report_link(ndp, true);
> > > return -ENODEV;
> > > }
> >
> > Also, for deselect package handler function, do we want to set to inactive here?
> > If we just change the state, the cached data still keeps the old value. If the new
> > ncsi_reset_dev() function is handling one by one, can we skip this part?
>
> Technically yes we could skip the state change here since
> ncsi_reset_dev() will have already done it. However if we send a DP
> command via some other means then it is probably best to ensure we treat
> all channels on that package as inactive.
When I tested, if I didn't comment out the state change in response handler,
ncsi_reset_dev() function will not handle properly and some channels got into
invisible state and at the end we lost those selectable channels.
>
> >
> > static int ncsi_rsp_handler_dp(struct ncsi_request *nr)
> > {
> > struct ncsi_rsp_pkt *rsp;
> > struct ncsi_dev_priv *ndp = nr->ndp;
> > struct ncsi_package *np;
> > struct ncsi_channel *nc;
> > unsigned long flags;
> >
> > /* Find the package */
> > rsp = (struct ncsi_rsp_pkt *)skb_network_header(nr->rsp);
> > ncsi_find_package_and_channel(ndp, rsp->rsp.common.channel,
> > &np, NULL);
> > if (!np)
> > return -ENODEV;
> >
> > /* Change state of all channels attached to the package */
> > NCSI_FOR_EACH_CHANNEL(np, nc) {
> > spin_lock_irqsave(&nc->lock, flags);
> > nc->state = NCSI_CHANNEL_INACTIVE;
> >
> > spin_unlock_irqrestore(&nc->lock, flags);
> > }
> >
> > return 0;
> > }
> >
> >
^ permalink raw reply
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Aaron Lu @ 2018-11-05 8:48 UTC (permalink / raw)
To: Tariq Toukan
Cc: Jesper Dangaard Brouer, Saeed Mahameed, pstaszewski@itcare.pl,
eric.dumazet@gmail.com, netdev@vger.kernel.org,
ilias.apalodimas@linaro.org, yoel@kviknet.dk,
mgorman@techsingularity.net
In-Reply-To: <a01c44c2-bb52-e575-62c0-e990b38bda53@mellanox.com>
On Mon, Nov 05, 2018 at 08:42:33AM +0000, Tariq Toukan wrote:
>
> On 03/11/2018 2:53 PM, Jesper Dangaard Brouer wrote:
> >
> > On Fri, 2 Nov 2018 22:20:24 +0800 Aaron Lu <aaron.lu@intel.com> wrote:
> >>
> >> I think here is a problem - order 0 pages are freed directly to buddy,
> >> bypassing per-cpu-pages. This might be the reason lock contention
> >> appeared on free path.
> >
> > OMG - you just found a significant issue with the network stacks
> > interaction with the page allocator! This explains why I could not get
> > the PCP (Per-Cpu-Pages) system to have good performance, in my
> > performance networking benchmarks. As we are basically only using the
> > alloc side of PCP, and not the free side.
> > We have spend years adding different driver level recycle tricks to
> > avoid this code path getting activated, exactly because it is rather
> > slow and problematic that we hit this zone->lock.
> >
>
> Oh! It has been behaving this way for too long.
> Good catch!
Thanks.
> >> Can someone apply below diff and see if lock contention is gone?
> >
> > I have also applied and tested this patch, and yes the lock contention
> > is gone. As mentioned is it rather difficult to hit this code path, as
> > the driver page recycle mechanism tries to hide/avoid it, but mlx5 +
> > page_pool + CPU-map recycling have a known weakness that bypass the
> > driver page recycle scheme (that I've not fixed yet). I observed a 7%
> > speedup for this micro benchmark.
> >
>
> Great news. I also have a benchmark that uses orde-r0 pages and stresses
> the zone-lock. I'll test your patch during this week.
Note this patch only helps when order-0 pages are freed through
page_frag_free().
I'll send a formal patch later.
> >
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index e2ef1c17942f..65c0ae13215a 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
> >> {
> >> struct page *page = virt_to_head_page(addr);
> >>
> >> - if (unlikely(put_page_testzero(page)))
> >> - __free_pages_ok(page, compound_order(page));
> >> + if (unlikely(put_page_testzero(page))) {
> >> + unsigned int order = compound_order(page);
> >> +
> >> + if (order == 0)
> >> + free_unref_page(page);
> >> + else
> >> + __free_pages_ok(page, order);
> >> + }
> >> }
> >> EXPORT_SYMBOL(page_frag_free);
> >
> > Thank you Aaron for spotting this!!!
> >
> Thanks Aaron :) !!
>
> Does it conflict with your recent work that optimizes order-0 allocation?
No it doesn't. This patch optimize code outside of zone lock(by reducing
the need to take zone lock) while my recent work optimize code inside
the zone lock :-)
^ permalink raw reply
* Re: [PATCH] net: skbuff.h: remove unnecessary unlikely()
From: David Miller @ 2018-11-05 18:09 UTC (permalink / raw)
To: tiny.windzz
Cc: edumazet, willemb, dja, ast, sbrivio, pabeni, linux-kernel,
netdev
In-Reply-To: <CAEExFWuGN_R=7B4ueAVA8hVoix8ko8zQSXzHxZB5gjwP7jOjpg@mail.gmail.com>
From: Frank Lee <tiny.windzz@gmail.com>
Date: Mon, 5 Nov 2018 21:21:50 +0800
> add netdev@vger.kernel.org
> -- Yangtao
Sorry, you can't do it like that.
You have to make a formal, fresh, posting to netdev with your patch.
Thank you.
^ permalink raw reply
* Re: linux-next: Tree for Nov 5 (net/ipv6/af_inet6)
From: David Miller @ 2018-11-05 18:12 UTC (permalink / raw)
To: rdunlap; +Cc: sfr, linux-next, linux-kernel, netdev, 0xeffeff
In-Reply-To: <2ad190b3-0b0d-972b-2a6e-16abf4a81c5b@infradead.org>
From: Randy Dunlap <rdunlap@infradead.org>
Date: Mon, 5 Nov 2018 08:21:28 -0800
> On 11/4/18 9:51 PM, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20181102:
>>
>> Non-merge commits (relative to Linus' tree): 418
>
> on i386 or x86_64:
>
> ld: net/ipv6/af_inet6.o: in function `inet6_init':
> af_inet6.c:(.init.text+0x285): undefined reference to `ipv6_anycast_init'
> ld: af_inet6.c:(.init.text+0x376): undefined reference to `ipv6_anycast_cleanup'
>
>
> Full randconfig file is attached (for i386).
Jeff, please fix this.
^ permalink raw reply
* Re: [PATCH V2 2/7] net: lorawan: Add LoRaWAN socket module
From: David Miller @ 2018-11-05 18:16 UTC (permalink / raw)
To: starnight
Cc: afaerber, netdev, linux-arm-kernel, linux-kernel, marcel,
dollar.chen, ken.yu, linux-wpan, stefan
In-Reply-To: <20181105165544.5215-3-starnight@g.ncu.edu.tw>
From: Jian-Hong Pan <starnight@g.ncu.edu.tw>
Date: Tue, 6 Nov 2018 00:55:40 +0800
> +static inline struct lrw_mac_cb * mac_cb(struct sk_buff *skb)
"mac_cb()" is pretty generic for a name, and leads to namespace pollution,
please use lrw_mac_cb() or similar.
> +static inline struct dgram_sock *
> +dgram_sk(const struct sock *sk)
> +{
> + return container_of(sk, struct dgram_sock, sk);
> +}
> +
> +static inline struct net_device *
> +lrw_get_dev_by_addr(struct net *net, u32 devaddr)
Never use inline for functions in a foo.c file, let the compiler decide.
> +{
> + struct net_device *ndev = NULL;
> + __be32 be_addr = cpu_to_be32(devaddr);
Always order local variables from longest to shortest line.
> +static int
> +dgram_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> + int noblock, int flags, int *addr_len)
> +{
> + struct sk_buff *skb;
> + size_t copied = 0;
> + DECLARE_SOCKADDR(struct sockaddr_lorawan *, saddr, msg->msg_name);
> + int err;
Likewise.
I'm not going to point out every single place where you have made these
two errors.
Please audit your entire submission and fix the problems wherever they
occur.
Thank you.
^ permalink raw reply
* [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
From: Aaron Lu @ 2018-11-05 8:58 UTC (permalink / raw)
To: linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
Dave Hansen
page_frag_free() calls __free_pages_ok() to free the page back to
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.
Paweł Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:
mlx5e_poll_tx_cq
|
--16.34%--napi_consume_skb
|
|--12.65%--__free_pages_ok
| |
| --11.86%--free_one_page
| |
| |--10.10%--queued_spin_lock_slowpath
| |
| --0.65%--_raw_spin_lock
|
|--1.55%--page_frag_free
|
--1.44%--skb_release_data
Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]
I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]
The free path as pointed out by Jesper is:
skb_free_head()
-> skb_free_frag()
-> skb_free_frag()
-> page_frag_free()
And the pages being freed on this path are order-0 pages.
Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.
With this change, Paweł hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone.
[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page)))
- __free_pages_ok(page, compound_order(page));
+ if (unlikely(put_page_testzero(page))) {
+ unsigned int order = compound_order(page);
+
+ if (order == 0)
+ free_unref_page(page);
+ else
+ __free_pages_ok(page, order);
+ }
}
EXPORT_SYMBOL(page_frag_free);
--
2.17.2
^ permalink raw reply related
* [PATCH 2/2] mm/page_alloc: use a single function to free page
From: Aaron Lu @ 2018-11-05 8:58 UTC (permalink / raw)
To: linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
Dave Hansen
In-Reply-To: <20181105085820.6341-1-aaron.lu@intel.com>
We have multiple places of freeing a page, most of them doing similar
things and a common function can be used to reduce code duplicate.
It also avoids bug fixed in one function and left in another.
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
mm/page_alloc.c | 37 +++++++++++++++++--------------------
1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a9a6af41a2..2b330296e92a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
}
EXPORT_SYMBOL(get_zeroed_page);
-void __free_pages(struct page *page, unsigned int order)
+/*
+ * Free a page by reducing its ref count by @nr.
+ * If its refcount reaches 0, then according to its order:
+ * order0: send to PCP;
+ * high order: directly send to Buddy.
+ */
+static inline void free_the_page(struct page *page, unsigned int order, int nr)
{
- if (put_page_testzero(page)) {
+ VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+ if (page_ref_sub_and_test(page, nr)) {
if (order == 0)
free_unref_page(page);
else
@@ -4435,6 +4443,11 @@ void __free_pages(struct page *page, unsigned int order)
}
}
+void __free_pages(struct page *page, unsigned int order)
+{
+ free_the_page(page, order, 1);
+}
+
EXPORT_SYMBOL(__free_pages);
void free_pages(unsigned long addr, unsigned int order)
@@ -4481,16 +4494,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
void __page_frag_cache_drain(struct page *page, unsigned int count)
{
- VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
- if (page_ref_sub_and_test(page, count)) {
- unsigned int order = compound_order(page);
-
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ free_the_page(page, compound_order(page), count);
}
EXPORT_SYMBOL(__page_frag_cache_drain);
@@ -4555,14 +4559,7 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page))) {
- unsigned int order = compound_order(page);
-
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ free_the_page(page, compound_order(page), 1);
}
EXPORT_SYMBOL(page_frag_free);
--
2.17.2
^ permalink raw reply related
* Re: [PATCH] staging: net: ipv4: tcp_westwood: fixed warnings and checks
From: David Miller @ 2018-11-05 18:20 UTC (permalink / raw)
To: suraj1998; +Cc: edumazet, kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <1541425985-31869-1-git-send-email-suraj1998@gmail.com>
From: Suraj Singh <suraj1998@gmail.com>
Date: Mon, 5 Nov 2018 19:23:05 +0530
> Fixed warnings and checks for TCP Westwood
>
> Signed-off-by: Suraj Singh <suraj1998@gmail.com>
Why 'staging' in the subject line?
^ permalink raw reply
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Jesper Dangaard Brouer @ 2018-11-05 9:10 UTC (permalink / raw)
To: Aaron Lu
Cc: Saeed Mahameed, pstaszewski@itcare.pl, eric.dumazet@gmail.com,
netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
yoel@kviknet.dk, mgorman@techsingularity.net, brouer,
Jérôme Glisse
In-Reply-To: <20181105062836.GB4502@intel.com>
On Mon, 5 Nov 2018 14:28:36 +0800
Aaron Lu <aaron.lu@intel.com> wrote:
> On Sat, Nov 03, 2018 at 01:53:25PM +0100, Jesper Dangaard Brouer wrote:
> >
> > On Fri, 2 Nov 2018 22:20:24 +0800 Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > > On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
> > > > On Fri, 2 Nov 2018 13:23:56 +0800
> > > > Aaron Lu <aaron.lu@intel.com> wrote:
> > > >
> > > > > On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
> > > > > > On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
> > > > > > > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > > > > > > wrote:
> > > > > > > ... ...
> > > > > > > > Section copied out:
> > > > > > > >
> > > > > > > > mlx5e_poll_tx_cq
> > > > > > > > |
> > > > > > > > --16.34%--napi_consume_skb
> > > > > > > > |
> > > > > > > > |--12.65%--__free_pages_ok
> > > > > > > > | |
> > > > > > > > | --11.86%--free_one_page
> > > > > > > > | |
> > > > > > > > | |--10.10%
> > > > > > > > --queued_spin_lock_slowpath
> > > > > > > > | |
> > > > > > > > | --0.65%--_raw_spin_lock
> > > > > > >
> > > > > > > This callchain looks like it is freeing higher order pages than order
> > > > > > > 0:
> > > > > > > __free_pages_ok is only called for pages whose order are bigger than
> > > > > > > 0.
> > > > > >
> > > > > > mlx5 rx uses only order 0 pages, so i don't know where these high order
> > > > > > tx SKBs are coming from..
> > > > >
> > > > > Perhaps here:
> > > > > __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
> > > > > __napi_alloc_frag() will all call page_frag_alloc(), which will use
> > > > > __page_frag_cache_refill() to get an order 3 page if possible, or fall
> > > > > back to an order 0 page if order 3 page is not available.
> > > > >
> > > > > I'm not sure if your workload will use the above code path though.
> > > >
> > > > TL;DR: this is order-0 pages (code-walk trough proof below)
> > > >
> > > > To Aaron, the network stack *can* call __free_pages_ok() with order-0
> > > > pages, via:
> > > >
> > > > static void skb_free_head(struct sk_buff *skb)
> > > > {
> > > > unsigned char *head = skb->head;
> > > >
> > > > if (skb->head_frag)
> > > > skb_free_frag(head);
> > > > else
> > > > kfree(head);
> > > > }
> > > >
> > > > static inline void skb_free_frag(void *addr)
> > > > {
> > > > page_frag_free(addr);
> > > > }
> > > >
> > > > /*
> > > > * Frees a page fragment allocated out of either a compound or order 0 page.
> > > > */
> > > > void page_frag_free(void *addr)
> > > > {
> > > > struct page *page = virt_to_head_page(addr);
> > > >
> > > > if (unlikely(put_page_testzero(page)))
> > > > __free_pages_ok(page, compound_order(page));
> > > > }
> > > > EXPORT_SYMBOL(page_frag_free);
> > >
> > > I think here is a problem - order 0 pages are freed directly to buddy,
> > > bypassing per-cpu-pages. This might be the reason lock contention
> > > appeared on free path.
> >
> > OMG - you just found a significant issue with the network stacks
> > interaction with the page allocator! This explains why I could not get
> > the PCP (Per-Cpu-Pages) system to have good performance, in my
> > performance networking benchmarks. As we are basically only using the
> > alloc side of PCP, and not the free side.
>
> Exactly.
>
> > We have spend years adding different driver level recycle tricks to
> > avoid this code path getting activated, exactly because it is rather
> > slow and problematic that we hit this zone->lock.
>
> I can see when this code path is hit, it causes unnecessary taking of
> zone lock for order-0 pages and cause lock contention.
>
> >
> > > Can someone apply below diff and see if lock contention is gone?
> >
> > I have also applied and tested this patch, and yes the lock contention
> > is gone. As mentioned is it rather difficult to hit this code path, as
> > the driver page recycle mechanism tries to hide/avoid it, but mlx5 +
> > page_pool + CPU-map recycling have a known weakness that bypass the
> > driver page recycle scheme (that I've not fixed yet). I observed a 7%
> > speedup for this micro benchmark.
>
> Good to know this, I will prepare a formal patch.
I wonder if this code is still missing something. I was looking at
using put_devmap_managed_page() infrastructure, but I realized that
page_frag_free() is also skipping this code path. I guess, I can add
it later when I show/proof (performance wise) that this is a good idea
(as we currently don't have any users).
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index e2ef1c17942f..65c0ae13215a 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
> > > {
> > > struct page *page = virt_to_head_page(addr);
> > >
> > > - if (unlikely(put_page_testzero(page)))
> > > - __free_pages_ok(page, compound_order(page));
> > > + if (unlikely(put_page_testzero(page))) {
> > > + unsigned int order = compound_order(page);
> > > +
> > > + if (order == 0)
> > > + free_unref_page(page);
> > > + else
> > > + __free_pages_ok(page, order);
> > > + }
> > > }
> > > EXPORT_SYMBOL(page_frag_free);
> >
> > Thank you Aaron for spotting this!!!
>
> Which is impossible without your analysis :-)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* Re: [PATCH 5/5] b43: Drop internal cordic algorithm implementation
From: Arend van Spriel @ 2018-11-05 9:11 UTC (permalink / raw)
To: Kalle Valo, Priit Laes
Cc: linux-kernel, David S. Miller, linux-wireless, b43-dev, netdev
In-Reply-To: <87y3a7gaag.fsf@codeaurora.org>
On 11/5/2018 10:09 AM, Kalle Valo wrote:
> Priit Laes <plaes@plaes.org> writes:
>
>> Signed-off-by: Priit Laes <plaes@plaes.org>
>
> No empty commit logs, please.
>
> And IMHO you could fold patch 5 into patch 4.
Similarly 2 and 3.
Regards,
Arend
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox