* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-07 1:12 Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case Jesus Sanchez-Palencia
` (19 more replies)
0 siblings, 20 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
This series is the v3 of the Time based packet transmission RFC, which was
originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
and further developed by us with the addition of the tbs qdisc
(v2: https://lwn.net/Articles/744797/ ).
It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
implements support for hw offloading on the igb driver for the Intel
i210 NIC. The tbs qdisc also supports SW best effort that can be used
as a fallback.
The main changes since v2 can be found below.
Fixes since v2:
- skb->tstamp is only cleared on the forwarding path;
- ktime_t is no longer the type used for timestamps (s64 is);
- get_unaligned() is now used for copying data from the cmsg header;
- added getsockopt() support for SO_TXTIME;
- restricted SO_TXTIME input range to [0,1];
- removed ns_capable() check from __sock_cmsg_send();
- the qdisc control struct now uses a 32 bitmap for config flags;
- fixed qdisc backlog decrement bug;
- 'overlimits' is now incremented on dequeue() drops in addition to the
'dropped' counter;
Interface changes since v2:
* CMSG interface:
- added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
- added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
* tc-tbs:
- clockid now receives a string;
e.g.: CLOCK_REALTIME or /dev/ptp0
- offload is now a standalone argument (i.e. no more offload 1);
- sorting is now argument that enables txtime based sorting provided
by the qdisc;
Design changes since v2:
- Now on the dequeue() path, tbs only drops an expired packet if it has the
skb->tc_drop_if_late flag set. In practical terms, this will define if
the semantics of txtime on a system is "not earlier than" or "not later
than" a given timestamp;
- Now on the enqueue() path, the qdisc will drop a packet if its clockid
doesn't match the qdisc's one;
- Sorting the packets based on their txtime is now an option for the disc.
Effectively, this means it can be configured in 4 modes: HW offload or
SW best-effort, sorting enabled or disabled;
The tbs qdisc is designed so it buffers packets until a configurable time before
their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
fallback modes, the qdisc uses a rbtree internally so the buffered packets are
always 'ordered' by the earliest deadline.
If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
it will use a 'scheduled' FIFO.
The other configurable parameter from the tbs qdisc is the clockid to be used.
In order to provide that, this series adds a new API to pkt_sched.h (i.e.
qdisc_watchdog_init_clockid()).
The tbs qdisc will drop any packets with a transmission time in the past or
when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
advance plus configuring the delta parameter for the system correctly makes
all the difference in reducing the number of drops. Moreover, note that the
delta parameter ends up defining the Tx time when SW best-effort is used
given that the timestamps won't be used by the NIC on this case.
Examples:
# SW best-effort with sorting #
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
clockid CLOCK_REALTIME sorting
In this example first the mqprio qdisc is setup, then the tbs qdisc is
configured onto the first hw Tx queue using SW best-effort with sorting
enabled. Also, it is configured so the timestamps on each packet are in
reference to the clockid CLOCK_REALTIME and so packets are dequeued from
the qdisc 100000 nanoseconds before their transmission time.
# HW offload without sorting #
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 tbs offload
In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed implicitly
the timestamp in skbuffs are in reference to the interface's PHC and
setting any other valid clockid would be treated as an error. Because
there is no scheduling being performed in the qdisc, setting a delta != 0
would also be considered an error.
# HW offload with sorting #
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
clockid CLOCK_REALTIME sorting
Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
and packets leave the Qdisc "delta" (100000) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.
For testing, we've followed a similar approach from the v1 and v2 testing and
no significant changes on the results were observed. An updated version of
udp_tai.c is attached to this cover letter.
For last, most of the To Dos we still have before a final patchset are related
to further testing the igb support:
- testing with L2 only talkers + AF_PACKET sockets;
- testing tbs in conjunction with cbs;
Thanks for all the feedback so far,
Jesus
Jesus Sanchez-Palencia (12):
sock: Fix SO_ZEROCOPY switch case
net: Clear skb->tstamp only on the forwarding path
posix-timers: Add CLOCKID_INVALID mask
net: SO_TXTIME: Add clockid and drop_if_late params
net: ipv4: raw: Handle remaining txtime parameters
net: ipv4: udp: Handle remaining txtime parameters
net: packet: Handle remaining txtime parameters
net/sched: Add HW offloading capability to TBS
igb: Refactor igb_configure_cbs()
igb: Only change Tx arbitration when CBS is on
igb: Refactor igb_offload_cbs()
igb: Add support for TBS offload
Richard Cochran (4):
net: Add a new socket option for a future transmit time.
net: ipv4: raw: Hook into time based transmission.
net: ipv4: udp: Hook into time based transmission.
net: packet: Hook into time based transmission.
Vinicius Costa Gomes (2):
net/sched: Allow creating a Qdisc watchdog with other clocks
net/sched: Introduce the TBS Qdisc
arch/alpha/include/uapi/asm/socket.h | 5 +
arch/frv/include/uapi/asm/socket.h | 5 +
arch/ia64/include/uapi/asm/socket.h | 5 +
arch/m32r/include/uapi/asm/socket.h | 5 +
arch/mips/include/uapi/asm/socket.h | 5 +
arch/mn10300/include/uapi/asm/socket.h | 5 +
arch/parisc/include/uapi/asm/socket.h | 5 +
arch/s390/include/uapi/asm/socket.h | 5 +
arch/sparc/include/uapi/asm/socket.h | 5 +
arch/xtensa/include/uapi/asm/socket.h | 5 +
drivers/net/ethernet/intel/igb/e1000_defines.h | 16 +
drivers/net/ethernet/intel/igb/igb.h | 1 +
drivers/net/ethernet/intel/igb/igb_main.c | 239 +++++++---
include/linux/netdevice.h | 2 +
include/linux/posix-timers.h | 1 +
include/linux/skbuff.h | 3 +
include/net/pkt_sched.h | 7 +
include/net/sock.h | 4 +
include/uapi/asm-generic/socket.h | 5 +
include/uapi/linux/pkt_sched.h | 18 +
net/core/skbuff.c | 1 -
net/core/sock.c | 44 +-
net/ipv4/raw.c | 7 +
net/ipv4/udp.c | 10 +-
net/packet/af_packet.c | 19 +
net/sched/Kconfig | 11 +
net/sched/Makefile | 1 +
net/sched/sch_api.c | 11 +-
net/sched/sch_tbs.c | 591 +++++++++++++++++++++++++
29 files changed, 978 insertions(+), 63 deletions(-)
create mode 100644 net/sched/sch_tbs.c
--
2.16.2
---8<---
/*
* This program demonstrates transmission of UDP packets using the
* system TAI timer.
*
* Copyright (C) 2017 linutronix GmbH
*
* Large portions taken from the linuxptp stack.
* Copyright (C) 2011, 2012 Richard Cochran <richardcochran@gmail.com>
*
* Some portions taken from the sgd test program.
* Copyright (C) 2015 linutronix GmbH
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*/
#define _GNU_SOURCE /*for CPU_SET*/
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <ifaddrs.h>
#include <linux/ethtool.h>
#include <linux/net_tstamp.h>
#include <linux/sockios.h>
#include <net/if.h>
#include <netinet/in.h>
#include <poll.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#define DEFAULT_PERIOD 1000000
#define DEFAULT_DELAY 500000
#define MCAST_IPADDR "239.1.1.1"
#define UDP_PORT 7788
#ifndef SO_TXTIME
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
#define SCM_DROP_IF_LATE 62
#define SCM_CLOCKID 63
#endif
#define pr_err(s) fprintf(stderr, s "\n")
#define pr_info(s) fprintf(stdout, s "\n")
static int running = 1, use_so_txtime = 1;
static int period_nsec = DEFAULT_PERIOD;
static int waketx_delay = DEFAULT_DELAY;
static struct in_addr mcast_addr;
static int mcast_bind(int fd, int index)
{
int err;
struct ip_mreqn req;
memset(&req, 0, sizeof(req));
req.imr_ifindex = index;
err = setsockopt(fd, IPPROTO_IP, IP_MULTICAST_IF, &req, sizeof(req));
if (err) {
pr_err("setsockopt IP_MULTICAST_IF failed: %m");
return -1;
}
return 0;
}
static int mcast_join(int fd, int index, const struct sockaddr *grp,
socklen_t grplen)
{
int err, off = 0;
struct ip_mreqn req;
struct sockaddr_in *sa = (struct sockaddr_in *) grp;
memset(&req, 0, sizeof(req));
memcpy(&req.imr_multiaddr, &sa->sin_addr, sizeof(struct in_addr));
req.imr_ifindex = index;
err = setsockopt(fd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &req, sizeof(req));
if (err) {
pr_err("setsockopt IP_ADD_MEMBERSHIP failed: %m");
return -1;
}
err = setsockopt(fd, IPPROTO_IP, IP_MULTICAST_LOOP, &off, sizeof(off));
if (err) {
pr_err("setsockopt IP_MULTICAST_LOOP failed: %m");
return -1;
}
return 0;
}
static void normalize(struct timespec *ts)
{
while (ts->tv_nsec > 999999999) {
ts->tv_sec += 1;
ts->tv_nsec -= 1000000000;
}
}
static int sk_interface_index(int fd, const char *name)
{
struct ifreq ifreq;
int err;
memset(&ifreq, 0, sizeof(ifreq));
strncpy(ifreq.ifr_name, name, sizeof(ifreq.ifr_name) - 1);
err = ioctl(fd, SIOCGIFINDEX, &ifreq);
if (err < 0) {
pr_err("ioctl SIOCGIFINDEX failed: %m");
return err;
}
return ifreq.ifr_ifindex;
}
static int open_socket(const char *name, struct in_addr mc_addr, short port)
{
struct sockaddr_in addr;
int fd, index, on = 1;
int priority = 3;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(port);
fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
if (fd < 0) {
pr_err("socket failed: %m");
goto no_socket;
}
index = sk_interface_index(fd, name);
if (index < 0)
goto no_option;
if (setsockopt(fd, SOL_SOCKET, SO_PRIORITY, &priority, sizeof(priority))) {
pr_err("Couldn't set priority");
goto no_option;
}
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on))) {
pr_err("setsockopt SO_REUSEADDR failed: %m");
goto no_option;
}
if (bind(fd, (struct sockaddr *) &addr, sizeof(addr))) {
pr_err("bind failed: %m");
goto no_option;
}
if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, name, strlen(name))) {
pr_err("setsockopt SO_BINDTODEVICE failed: %m");
goto no_option;
}
addr.sin_addr = mc_addr;
if (mcast_join(fd, index, (struct sockaddr *) &addr, sizeof(addr))) {
pr_err("mcast_join failed");
goto no_option;
}
if (mcast_bind(fd, index)) {
goto no_option;
}
if (use_so_txtime && setsockopt(fd, SOL_SOCKET, SO_TXTIME, &on, sizeof(on))) {
pr_err("setsockopt SO_TXTIME failed: %m");
goto no_option;
}
return fd;
no_option:
close(fd);
no_socket:
return -1;
}
static int udp_open(const char *name)
{
int fd;
if (!inet_aton(MCAST_IPADDR, &mcast_addr))
return -1;
fd = open_socket(name, mcast_addr, UDP_PORT);
return fd;
}
static int udp_send(int fd, void *buf, int len, __u64 txtime, clockid_t clkid)
{
char control[CMSG_SPACE(sizeof(txtime)) + CMSG_SPACE(sizeof(clkid)) + CMSG_SPACE(sizeof(uint8_t))] = {};
struct sockaddr_in sin;
struct cmsghdr *cmsg;
struct msghdr msg;
struct iovec iov;
ssize_t cnt;
uint8_t drop_if_late = 1;
memset(&sin, 0, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_addr = mcast_addr;
sin.sin_port = htons(UDP_PORT);
iov.iov_base = buf;
iov.iov_len = len;
memset(&msg, 0, sizeof(msg));
msg.msg_name = &sin;
msg.msg_namelen = sizeof(sin);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
/*
* We specify the transmission time in the CMSG.
*/
if (use_so_txtime) {
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_TXTIME;
cmsg->cmsg_len = CMSG_LEN(sizeof(__u64));
*((__u64 *) CMSG_DATA(cmsg)) = txtime;
cmsg = CMSG_NXTHDR(&msg, cmsg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_CLOCKID;
cmsg->cmsg_len = CMSG_LEN(sizeof(clockid_t));
*((clockid_t *) CMSG_DATA(cmsg)) = clkid;
cmsg = CMSG_NXTHDR(&msg, cmsg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_DROP_IF_LATE;
cmsg->cmsg_len = CMSG_LEN(sizeof(uint8_t));
*((uint8_t *) CMSG_DATA(cmsg)) = drop_if_late;
}
cnt = sendmsg(fd, &msg, 0);
if (cnt < 1) {
pr_err("sendmsg failed: %m");
return cnt;
}
return cnt;
}
static unsigned char tx_buffer[256];
static int marker;
static int run_nanosleep(clockid_t clkid, int fd)
{
struct timespec ts;
int cnt, err;
__u64 txtime;
clock_gettime(clkid, &ts);
/* Start one to two seconds in the future. */
ts.tv_sec += 1;
ts.tv_nsec = 1000000000 - waketx_delay;
normalize(&ts);
txtime = ts.tv_sec * 1000000000ULL + ts.tv_nsec;
txtime += waketx_delay;
while (running) {
err = clock_nanosleep(clkid, TIMER_ABSTIME, &ts, NULL);
switch (err) {
case 0:
cnt = udp_send(fd, tx_buffer, sizeof(tx_buffer), txtime, clkid);
if (cnt != sizeof(tx_buffer)) {
pr_err("udp_send failed");
}
memset(tx_buffer, marker++, sizeof(tx_buffer));
ts.tv_nsec += period_nsec;
normalize(&ts);
txtime += period_nsec;
break;
case EINTR:
continue;
default:
fprintf(stderr, "clock_nanosleep returned %d: %s",
err, strerror(err));
return err;
}
}
return 0;
}
static int set_realtime(pthread_t thread, int priority, int cpu)
{
cpu_set_t cpuset;
struct sched_param sp;
int err, policy;
int min = sched_get_priority_min(SCHED_FIFO);
int max = sched_get_priority_max(SCHED_FIFO);
fprintf(stderr, "min %d max %d\n", min, max);
if (priority < 0) {
return 0;
}
err = pthread_getschedparam(thread, &policy, &sp);
if (err) {
fprintf(stderr, "pthread_getschedparam: %s\n", strerror(err));
return -1;
}
sp.sched_priority = priority;
err = pthread_setschedparam(thread, SCHED_FIFO, &sp);
if (err) {
fprintf(stderr, "pthread_setschedparam: %s\n", strerror(err));
return -1;
}
if (cpu < 0) {
return 0;
}
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
err = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (err) {
fprintf(stderr, "pthread_setaffinity_np: %s\n", strerror(err));
return -1;
}
return 0;
}
static void usage(char *progname)
{
fprintf(stderr,
"\n"
"usage: %s [options]\n"
"\n"
" -c [num] run on CPU 'num'\n"
" -d [num] delay from wake up to transmission in nanoseconds (default %d)\n"
" -h prints this message and exits\n"
" -i [name] use network interface 'name'\n"
" -p [num] run with RT priorty 'num'\n"
" -P [num] period in nanoseconds (default %d)\n"
" -u do not use SO_TXTIME\n"
"\n",
progname, DEFAULT_DELAY, DEFAULT_PERIOD);
}
int main(int argc, char *argv[])
{
int c, cpu = -1, err, fd, priority = -1;
clockid_t clkid = CLOCK_REALTIME;
char *iface = NULL, *progname;
/* Process the command line arguments. */
progname = strrchr(argv[0], '/');
progname = progname ? 1 + progname : argv[0];
while (EOF != (c = getopt(argc, argv, "c:d:hi:p:P:u"))) {
switch (c) {
case 'c':
cpu = atoi(optarg);
break;
case 'd':
waketx_delay = atoi(optarg);
break;
case 'h':
usage(progname);
return 0;
case 'i':
iface = optarg;
break;
case 'p':
priority = atoi(optarg);
break;
case 'P':
period_nsec = atoi(optarg);
break;
case 'u':
use_so_txtime = 0;
break;
case '?':
usage(progname);
return -1;
}
}
if (waketx_delay > 999999999 || waketx_delay < 0) {
pr_err("Bad wake up to transmission delay.");
usage(progname);
return -1;
}
if (period_nsec < 1000) {
pr_err("Bad period.");
usage(progname);
return -1;
}
if (!iface) {
pr_err("Need a network interface.");
usage(progname);
return -1;
}
if (set_realtime(pthread_self(), priority, cpu)) {
return -1;
}
fd = udp_open(iface);
if (fd < 0) {
return -1;
}
err = run_nanosleep(clkid, fd);
close(fd);
return err;
}
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 16:58 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path Jesus Sanchez-Palencia
` (18 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
ret values to be overwritten by the one set on the default case.
Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/core/sock.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 507d8c6c4319..27f218bba43f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1062,8 +1062,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
ret = -EINVAL;
else
sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
- break;
}
+ break;
+
default:
ret = -ENOPROTOOPT;
break;
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 16:59 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask Jesus Sanchez-Palencia
` (17 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() would break our feature when
tunnels are used.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
include/linux/netdevice.h | 1 +
net/core/skbuff.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dbe6344b727a..7104de2bc957 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
skb_scrub_packet(skb, true);
skb->priority = 0;
+ skb->tstamp = 0;
return 0;
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..678fc5416ae1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
*/
void skb_scrub_packet(struct sk_buff *skb, bool xnet)
{
- skb->tstamp = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time Jesus Sanchez-Palencia
` (16 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
posix-timers.h states that a clockid_t value is invalid if bits 0, 1 and
2 are all set. Add a mask that can be safely used elsewhere even if this
implicit rule's implementation is changed.
This is done in preparation for the upcoming time based transmission
patchset.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
include/linux/posix-timers.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index c85704fcdbd2..0ba677cc8da6 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -28,6 +28,7 @@ struct cpu_timer_list {
*
* A clockid is invalid if bits 2, 1, and 0 are all set.
*/
+#define CLOCKID_INVALID GENMASK(2, 0)
#define CPUCLOCK_PID(clock) ((pid_t) ~((clock) >> 3))
#define CPUCLOCK_PERTHREAD(clock) \
(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time.
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (2 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission Jesus Sanchez-Palencia
` (15 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
From: Richard Cochran <rcochran@linutronix.de>
This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2).
A new field is added to struct sockcm_cookie, and the tstamp from
skbuffs will be used later on.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
arch/alpha/include/uapi/asm/socket.h | 3 +++
arch/frv/include/uapi/asm/socket.h | 3 +++
arch/ia64/include/uapi/asm/socket.h | 3 +++
arch/m32r/include/uapi/asm/socket.h | 3 +++
arch/mips/include/uapi/asm/socket.h | 3 +++
arch/mn10300/include/uapi/asm/socket.h | 3 +++
arch/parisc/include/uapi/asm/socket.h | 3 +++
arch/s390/include/uapi/asm/socket.h | 3 +++
arch/sparc/include/uapi/asm/socket.h | 3 +++
arch/xtensa/include/uapi/asm/socket.h | 3 +++
include/net/sock.h | 2 ++
include/uapi/asm-generic/socket.h | 3 +++
net/core/sock.c | 21 +++++++++++++++++++++
13 files changed, 56 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 9168e78fa32a..0e95f45cd058 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -105,5 +105,8 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index cf5018e82c3d..65276c95b8df 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index b35eee132142..d029a40b1b55 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
#define SO_ZEROCOPY 0x4035
+#define SO_TXTIME 0x4036
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
#define SO_ZEROCOPY 0x003e
+#define SO_TXTIME 0x003f
+#define SCM_TXTIME SO_TXTIME
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..16a90a69c9b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -778,6 +778,7 @@ enum sock_flags {
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+ SOCK_TXTIME,
};
#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1568,6 +1569,7 @@ void sock_kzfree_s(struct sock *sk, void *mem, int size);
void sk_send_sigurg(struct sock *sk);
struct sockcm_cookie {
+ u64 transmit_time;
u32 mark;
u16 tsflags;
};
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 0ae758c90e54..a12692e5f7a8 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -107,4 +107,7 @@
#define SO_ZEROCOPY 60
+#define SO_TXTIME 61
+#define SCM_TXTIME SO_TXTIME
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 27f218bba43f..2ba09f311e71 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -91,6 +91,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <asm/unaligned.h>
#include <linux/capability.h>
#include <linux/errno.h>
#include <linux/errqueue.h>
@@ -1065,6 +1066,15 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
}
break;
+ case SO_TXTIME:
+ if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+ ret = -EPERM;
+ else if (val < 0 || val > 1)
+ ret = -EINVAL;
+ else
+ sock_valbool_flag(sk, SOCK_TXTIME, valbool);
+ break;
+
default:
ret = -ENOPROTOOPT;
break;
@@ -1398,6 +1408,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sock_flag(sk, SOCK_ZEROCOPY);
break;
+ case SO_TXTIME:
+ v.val = sock_flag(sk, SOCK_TXTIME);
+ break;
+
default:
/* We implement the SO_SNDLOWAT etc to not be settable
* (1003.1g 7).
@@ -2132,6 +2146,13 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
sockc->tsflags |= tsflags;
break;
+ case SO_TXTIME:
+ if (!sock_flag(sk, SOCK_TXTIME))
+ return -EINVAL;
+ if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
+ return -EINVAL;
+ sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
+ break;
/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
case SCM_RIGHTS:
case SCM_CREDENTIALS:
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (3 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 17:00 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: " Jesus Sanchez-Palencia
` (14 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
From: Richard Cochran <rcochran@linutronix.de>
For raw packets, copy the desired future transmit time from the CMSG
cookie into the skb.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/ipv4/raw.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..8e05970ba7c4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+ skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
ipc.sockc.tsflags = sk->sk_tsflags;
+ ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (4 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 17:00 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 07/18] net: packet: " Jesus Sanchez-Palencia
` (13 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
From: Richard Cochran <rcochran@linutronix.de>
For udp packets, copy the desired future transmit time from the CMSG
cookie into the skb.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/ipv4/udp.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..d683bbde526b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
ipc.sockc.tsflags = sk->sk_tsflags;
+ ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
@@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
sizeof(struct udphdr), &ipc, &rt,
msg->msg_flags);
err = PTR_ERR(skb);
- if (!IS_ERR_OR_NULL(skb))
+ if (!IS_ERR_OR_NULL(skb)) {
+ skb->tstamp = ipc.sockc.transmit_time;
err = udp_send_skb(skb, fl4);
+ }
goto out;
}
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 07/18] net: packet: Hook into time based transmission.
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (5 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: " Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params Jesus Sanchez-Palencia
` (12 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
From: Richard Cochran <rcochran@linutronix.de>
For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/packet/af_packet.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2c5a6fe5d749..b2115fac2a8d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
goto out_unlock;
}
+ sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
@@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+ skb->tstamp = sockc.transmit_time;
sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
@@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+ skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
@@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
+ sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
+ sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+ skb->tstamp = sockc.transmit_time;
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (6 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 07/18] net: packet: " Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 2:53 ` Eric Dumazet
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters Jesus Sanchez-Palencia
` (11 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
a drop_if_late flag. With this commit the API becomes:
- use SO_TXTIME to enable the feature on a socket;
- pass the per-packet arguments through the cmsg header using:
* SCM_CLOCKID for the clockid to be used as the txtime clock source;
* SCM_TXTIME for the txtime timestamp;
* SCM_DROP_IF_LATE for the drop flag. This flag will be used by the
traffic control to decide if a delayed packet should be dropped.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
arch/alpha/include/uapi/asm/socket.h | 2 ++
arch/frv/include/uapi/asm/socket.h | 2 ++
arch/ia64/include/uapi/asm/socket.h | 2 ++
arch/m32r/include/uapi/asm/socket.h | 2 ++
arch/mips/include/uapi/asm/socket.h | 2 ++
arch/mn10300/include/uapi/asm/socket.h | 2 ++
arch/parisc/include/uapi/asm/socket.h | 2 ++
arch/s390/include/uapi/asm/socket.h | 2 ++
arch/sparc/include/uapi/asm/socket.h | 2 ++
arch/xtensa/include/uapi/asm/socket.h | 2 ++
include/linux/skbuff.h | 3 +++
include/net/sock.h | 2 ++
include/uapi/asm-generic/socket.h | 2 ++
net/core/sock.c | 22 +++++++++++++++++++++-
14 files changed, 48 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..3399dfefa579 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -114,5 +114,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 0e95f45cd058..43b636836722 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -107,6 +107,8 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..1f06d07aadbe 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -116,5 +116,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 65276c95b8df..69ab380d8d48 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..97da79f58538 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -125,5 +125,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index d029a40b1b55..7c7a174fdfae 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..7fe86b5cd593 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -106,5 +106,7 @@
#define SO_TXTIME 0x4036
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 0x4037
+#define SCM_CLOCKID 0x4038
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..97f90c4a9b8c 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -113,5 +113,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..6397c366dd2d 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -103,6 +103,8 @@
#define SO_TXTIME 0x003f
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 0x0040
+#define SCM_CLOCKID 0x0041
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 1de07a7f7680..bc81b02a1f5f 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -118,5 +118,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8340e6e8814..951969ceaf65 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -788,6 +788,9 @@ struct sk_buff {
__u8 tc_redirected:1;
__u8 tc_from_ingress:1;
#endif
+ __u8 tc_drop_if_late:1;
+
+ clockid_t txtime_clockid;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
diff --git a/include/net/sock.h b/include/net/sock.h
index 16a90a69c9b3..50e36e0f62f6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1571,7 +1571,9 @@ void sk_send_sigurg(struct sock *sk);
struct sockcm_cookie {
u64 transmit_time;
u32 mark;
+ clockid_t clockid;
u16 tsflags;
+ u8 drop_if_late;
};
int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index a12692e5f7a8..c9e1ea0097e1 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -109,5 +109,7 @@
#define SO_TXTIME 61
#define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE 62
+#define SCM_CLOCKID 63
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 2ba09f311e71..51cfade342ec 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2126,6 +2126,7 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
struct sockcm_cookie *sockc)
{
u32 tsflags;
+ u8 drop;
switch (cmsg->cmsg_type) {
case SO_MARK:
@@ -2146,13 +2147,32 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
sockc->tsflags |= tsflags;
break;
- case SO_TXTIME:
+ case SCM_TXTIME:
if (!sock_flag(sk, SOCK_TXTIME))
return -EINVAL;
if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
return -EINVAL;
sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
break;
+ case SCM_DROP_IF_LATE:
+ if (!sock_flag(sk, SOCK_TXTIME))
+ return -EINVAL;
+ if (cmsg->cmsg_len != CMSG_LEN(sizeof(u8)))
+ return -EINVAL;
+
+ drop = get_unaligned((u8 *)CMSG_DATA(cmsg));
+ if (drop < 0 || drop > 1)
+ return -EINVAL;
+
+ sockc->drop_if_late = drop;
+ break;
+ case SCM_CLOCKID:
+ if (!sock_flag(sk, SOCK_TXTIME))
+ return -EINVAL;
+ if (cmsg->cmsg_len != CMSG_LEN(sizeof(clockid_t)))
+ return -EINVAL;
+ sockc->clockid = get_unaligned((clockid_t *)CMSG_DATA(cmsg));
+ break;
/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
case SCM_RIGHTS:
case SCM_CREDENTIALS:
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (7 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 10/18] net: ipv4: udp: " Jesus Sanchez-Palencia
` (10 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/ipv4/raw.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8e05970ba7c4..61b6acccc72b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -79,6 +79,7 @@
#include <linux/netfilter_ipv4.h>
#include <linux/compat.h>
#include <linux/uio.h>
+#include <linux/posix-timers.h>
struct raw_frag_vec {
struct msghdr *msg;
@@ -382,6 +383,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb->tstamp = sockc->transmit_time;
+ skb->txtime_clockid = sockc->clockid;
+ skb->tc_drop_if_late = sockc->drop_if_late;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
@@ -564,6 +567,8 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
ipc.sockc.tsflags = sk->sk_tsflags;
ipc.sockc.transmit_time = 0;
+ ipc.sockc.drop_if_late = 0;
+ ipc.sockc.clockid = CLOCKID_INVALID;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 10/18] net: ipv4: udp: Handle remaining txtime parameters
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (8 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 11/18] net: packet: " Jesus Sanchez-Palencia
` (9 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/ipv4/udp.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d683bbde526b..4bea8d5ab968 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
#include "udp_impl.h"
#include <net/sock_reuseport.h>
#include <net/addrconf.h>
+#include <linux/posix-timers.h>
struct udp_table udp_table __read_mostly;
EXPORT_SYMBOL(udp_table);
@@ -927,6 +928,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
ipc.sockc.tsflags = sk->sk_tsflags;
ipc.sockc.transmit_time = 0;
+ ipc.sockc.drop_if_late = 0;
+ ipc.sockc.clockid = CLOCKID_INVALID;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
@@ -1043,6 +1046,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb)) {
skb->tstamp = ipc.sockc.transmit_time;
+ skb->txtime_clockid = ipc.sockc.clockid;
+ skb->tc_drop_if_late = ipc.sockc.drop_if_late;
err = udp_send_skb(skb, fl4);
}
goto out;
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 11/18] net: packet: Handle remaining txtime parameters
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (9 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 10/18] net: ipv4: udp: " Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks Jesus Sanchez-Palencia
` (8 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
net/packet/af_packet.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b2115fac2a8d..e455fbf5a356 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -94,6 +94,7 @@
#endif
#include <linux/bpf.h>
#include <net/compat.h>
+#include <linux/posix-timers.h>
#include "internal.h"
@@ -1977,6 +1978,8 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
}
sockc.transmit_time = 0;
+ sockc.drop_if_late = 0;
+ sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
@@ -1989,6 +1992,8 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb->tstamp = sockc.transmit_time;
+ skb->tc_drop_if_late = sockc.drop_if_late;
+ skb->txtime_clockid = sockc.clockid;
sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
@@ -2487,6 +2492,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
skb->tstamp = sockc->transmit_time;
+ skb->tc_drop_if_late = sockc->drop_if_late;
+ skb->txtime_clockid = sockc->clockid;
sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
@@ -2664,6 +2671,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
goto out_put;
sockc.transmit_time = 0;
+ sockc.drop_if_late = 0;
+ sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2861,6 +2870,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
goto out_unlock;
sockc.transmit_time = 0;
+ sockc.drop_if_late = 0;
+ sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2934,6 +2945,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
skb->tstamp = sockc.transmit_time;
+ skb->tc_drop_if_late = sockc.drop_if_late;
+ skb->txtime_clockid = sockc.clockid;
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (10 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 11/18] net: packet: " Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Jesus Sanchez-Palencia
` (7 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
include/net/pkt_sched.h | 2 ++
net/sched/sch_api.c | 11 +++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc *qdisc;
};
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc,
+ clockid_t clockid);
void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 68f9d942bed4..beb1dc296bfb 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc,
+ clockid_t clockid)
{
- hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
}
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+ qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
EXPORT_SYMBOL(qdisc_watchdog_init);
void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (11 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-21 13:46 ` Thomas Gleixner
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS Jesus Sanchez-Palencia
` (6 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
TBS (Time Based Scheduler) uses the information added earlier in this
series (the socket option SO_TXTIME and the new role of
sk_buff->tstamp) to schedule traffic transmission based on absolute
time.
For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.
Example:
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
clockid CLOCK_REALTIME sorting
In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in socket
are in reference to the clockid CLOCK_REALTIME and packets leave the
Qdisc "delta" (100000) nanoseconds before its transmission time. It will
also enable sorting of the buffered packets based on their txtime.
The qdisc will drop packets on enqueue() if their skbuff clockid does not
match the clock reference of the Qdisc. Moreover, the tc_drop_if_late
flag from skbuffs will be used on dequeue() to determine if a packet
that has expired while being enqueued should be dropped or not.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
include/linux/netdevice.h | 1 +
include/uapi/linux/pkt_sched.h | 17 ++
net/sched/Kconfig | 11 +
net/sched/Makefile | 1 +
net/sched/sch_tbs.c | 474 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 504 insertions(+)
create mode 100644 net/sched/sch_tbs.c
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7104de2bc957..09b5b2e08f04 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -781,6 +781,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_CBS,
TC_SETUP_QDISC_RED,
TC_SETUP_QDISC_PRIO,
+ TC_SETUP_QDISC_TBS,
};
/* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..a33b5b9da81a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,21 @@ enum {
#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
+
+/* TBS */
+struct tc_tbs_qopt {
+ __s32 delta;
+ __s32 clockid;
+ __u32 flags;
+#define TC_TBS_SORTING_ON BIT(0)
+};
+
+enum {
+ TCA_TBS_UNSPEC,
+ TCA_TBS_PARMS,
+ __TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..9e68fef78d50 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
To compile this code as a module, choose M here: the
module will be called sch_cbs.
+config NET_SCH_TBS
+ tristate "Time Based Scheduler (TBS)"
+ ---help---
+ Say Y here if you want to use the Time Based Scheduler (TBS) packet
+ scheduling algorithm.
+
+ See the top of <file:net/sched/sch_tbs.c> for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_tbs.
+
config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..f02378a0a8f2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ) += sch_fq.o
obj-$(CONFIG_NET_SCH_HHF) += sch_hhf.o
obj-$(CONFIG_NET_SCH_PIE) += sch_pie.o
obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o
+obj-$(CONFIG_NET_SCH_TBS) += sch_tbs.o
obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
new file mode 100644
index 000000000000..c19eedda9bc5
--- /dev/null
+++ b/net/sched/sch_tbs.c
@@ -0,0 +1,474 @@
+/*
+ * net/sched/sch_tbs.c Time Based Shaper
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
+ * Vinicius Costa Gomes <vinicius.gomes@intel.com>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/rbtree.h>
+#include <linux/skbuff.h>
+#include <linux/posix-timers.h>
+#include <net/netlink.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+#include <net/sock.h>
+
+#define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+
+struct tbs_sched_data {
+ bool sorting;
+ int clockid;
+ int queue;
+ s32 delta; /* in ns */
+ ktime_t last; /* The txtime of the last skb sent to the netdevice. */
+ struct rb_root head;
+ struct qdisc_watchdog watchdog;
+ struct Qdisc *qdisc;
+ int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch,
+ struct sk_buff **to_free);
+ struct sk_buff *(*dequeue)(struct Qdisc *sch);
+ struct sk_buff *(*peek)(struct Qdisc *sch);
+};
+
+static const struct nla_policy tbs_policy[TCA_TBS_MAX + 1] = {
+ [TCA_TBS_PARMS] = { .len = sizeof(struct tc_tbs_qopt) },
+};
+
+typedef ktime_t (*get_time_func_t)(void);
+
+static const get_time_func_t clockid_to_get_time[MAX_CLOCKS] = {
+ [CLOCK_MONOTONIC] = ktime_get,
+ [CLOCK_REALTIME] = ktime_get_real,
+ [CLOCK_BOOTTIME] = ktime_get_boottime,
+ [CLOCK_TAI] = ktime_get_clocktai,
+};
+
+static ktime_t get_time_by_clockid(clockid_t clockid)
+{
+ get_time_func_t func = clockid_to_get_time[clockid];
+
+ if (!func)
+ return 0;
+
+ return func();
+}
+
+static inline int validate_input_params(struct tc_tbs_qopt *qopt,
+ struct netlink_ext_ack *extack)
+{
+ /* Check if params comply to the following rules:
+ * * If SW best-effort, then clockid and delta must be valid
+ * regardless of sorting enabled or not.
+ *
+ * * Dynamic clockids are not supported.
+ * * Delta must be a positive integer.
+ */
+ if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+ qopt->clockid >= MAX_CLOCKS) {
+ NL_SET_ERR_MSG(extack, "Invalid clockid");
+ return -EINVAL;
+ } else if (qopt->clockid < 0 ||
+ !clockid_to_get_time[qopt->clockid]) {
+ NL_SET_ERR_MSG(extack, "Clockid is not supported");
+ return -ENOTSUPP;
+ }
+
+ if (qopt->delta < 0) {
+ NL_SET_ERR_MSG(extack, "Delta must be positive");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ ktime_t txtime = nskb->tstamp;
+ struct sock *sk = nskb->sk;
+ ktime_t now;
+
+ if (sk && !sock_flag(sk, SOCK_TXTIME))
+ return false;
+
+ /* We don't perform crosstimestamping.
+ * Drop if packet's clockid differs from qdisc's.
+ */
+ if (nskb->txtime_clockid != q->clockid)
+ return false;
+
+ now = get_time_by_clockid(q->clockid);
+ if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
+ return false;
+
+ return true;
+}
+
+static struct sk_buff *tbs_peek(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+
+ return q->peek(sch);
+}
+
+static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct rb_node *p;
+
+ p = rb_first(&q->head);
+ if (!p)
+ return NULL;
+
+ return rb_to_skb(p);
+}
+
+static void reset_watchdog(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb = tbs_peek(sch);
+ ktime_t next;
+
+ if (!skb)
+ return;
+
+ next = ktime_sub_ns(skb->tstamp, q->delta);
+ qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
+}
+
+static int tbs_enqueue(struct sk_buff *nskb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+
+ return q->enqueue(nskb, sch, to_free);
+}
+
+static int tbs_enqueue_scheduledfifo(struct sk_buff *nskb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+ int err;
+
+ if (!is_packet_valid(sch, nskb))
+ return qdisc_drop(nskb, sch, to_free);
+
+ err = qdisc_enqueue_tail(nskb, sch);
+
+ /* If there is only 1 packet, then we must reset the watchdog. */
+ if (err >= 0 && sch->q.qlen == 1)
+ reset_watchdog(sch);
+
+ return err;
+}
+
+static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct rb_node **p = &q->head.rb_node, *parent = NULL;
+ ktime_t txtime = nskb->tstamp;
+
+ if (!is_packet_valid(sch, nskb))
+ return qdisc_drop(nskb, sch, to_free);
+
+ while (*p) {
+ struct sk_buff *skb;
+
+ parent = *p;
+ skb = rb_to_skb(parent);
+ if (ktime_after(txtime, skb->tstamp))
+ p = &parent->rb_right;
+ else
+ p = &parent->rb_left;
+ }
+ rb_link_node(&nskb->rbnode, parent, p);
+ rb_insert_color(&nskb->rbnode, &q->head);
+
+ qdisc_qstats_backlog_inc(sch, nskb);
+ sch->q.qlen++;
+
+ /* Now we may need to re-arm the qdisc watchdog for the next packet. */
+ reset_watchdog(sch);
+
+ return NET_XMIT_SUCCESS;
+}
+
+static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
+ bool drop)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+
+ rb_erase(&skb->rbnode, &q->head);
+
+ qdisc_qstats_backlog_dec(sch, skb);
+
+ if (drop) {
+ struct sk_buff *to_free = NULL;
+
+ qdisc_drop(skb, sch, &to_free);
+ kfree_skb_list(to_free);
+ qdisc_qstats_overlimit(sch);
+ } else {
+ qdisc_bstats_update(sch, skb);
+
+ q->last = skb->tstamp;
+ }
+
+ sch->q.qlen--;
+
+ /* The rbnode field in the skb re-uses these fields, now that
+ * we are done with the rbnode, reset them.
+ */
+ skb->next = NULL;
+ skb->prev = NULL;
+ skb->dev = qdisc_dev(sch);
+}
+
+static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+
+ return q->dequeue(sch);
+}
+
+static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb = tbs_peek(sch);
+ ktime_t now, next;
+
+ if (!skb)
+ return NULL;
+
+ now = get_time_by_clockid(q->clockid);
+
+ /* Drop if packet has expired while in queue and the drop_if_late
+ * flag is set.
+ */
+ if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
+ struct sk_buff *to_free = NULL;
+
+ qdisc_queue_drop_head(sch, &to_free);
+ kfree_skb_list(to_free);
+ qdisc_qstats_overlimit(sch);
+
+ skb = NULL;
+ goto out;
+ }
+
+ next = ktime_sub_ns(skb->tstamp, q->delta);
+
+ /* Dequeue only if now is within the [txtime - delta, txtime] range. */
+ if (ktime_after(now, next))
+ skb = qdisc_dequeue_head(sch);
+ else
+ skb = NULL;
+
+out:
+ /* Now we may need to re-arm the qdisc watchdog for the next packet. */
+ reset_watchdog(sch);
+
+ return skb;
+}
+
+static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb;
+ ktime_t now, next;
+
+ skb = tbs_peek(sch);
+ if (!skb)
+ return NULL;
+
+ now = get_time_by_clockid(q->clockid);
+
+ /* Drop if packet has expired while in queue and the drop_if_late
+ * flag is set.
+ */
+ if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
+ timesortedlist_erase(sch, skb, true);
+ skb = NULL;
+ goto out;
+ }
+
+ next = ktime_sub_ns(skb->tstamp, q->delta);
+
+ /* Dequeue only if now is within the [txtime - delta, txtime] range. */
+ if (ktime_after(now, next))
+ timesortedlist_erase(sch, skb, false);
+ else
+ skb = NULL;
+
+out:
+ /* Now we may need to re-arm the qdisc watchdog for the next packet. */
+ reset_watchdog(sch);
+
+ return skb;
+}
+
+static inline void setup_queueing_mode(struct tbs_sched_data *q)
+{
+ if (q->sorting) {
+ q->enqueue = tbs_enqueue_timesortedlist;
+ q->dequeue = tbs_dequeue_timesortedlist;
+ q->peek = tbs_peek_timesortedlist;
+ } else {
+ q->enqueue = tbs_enqueue_scheduledfifo;
+ q->dequeue = tbs_dequeue_scheduledfifo;
+ q->peek = qdisc_peek_head;
+ }
+}
+
+static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
+ struct netlink_ext_ack *extack)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct net_device *dev = qdisc_dev(sch);
+ struct nlattr *tb[TCA_TBS_MAX + 1];
+ struct tc_tbs_qopt *qopt;
+ int err;
+
+ if (!opt) {
+ NL_SET_ERR_MSG(extack, "Missing TBS qdisc options which are mandatory");
+ return -EINVAL;
+ }
+
+ err = nla_parse_nested(tb, TCA_TBS_MAX, opt, tbs_policy, extack);
+ if (err < 0)
+ return err;
+
+ if (!tb[TCA_TBS_PARMS]) {
+ NL_SET_ERR_MSG(extack, "Missing mandatory TBS parameters");
+ return -EINVAL;
+ }
+
+ qopt = nla_data(tb[TCA_TBS_PARMS]);
+
+ pr_debug("delta %d clockid %d sorting %s\n",
+ qopt->delta, qopt->clockid,
+ SORTING_IS_ON(qopt) ? "on" : "off");
+
+ err = validate_input_params(qopt, extack);
+ if (err < 0)
+ return err;
+
+ q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
+
+ /* Everything went OK, save the parameters used. */
+ q->delta = qopt->delta;
+ q->clockid = qopt->clockid;
+ q->sorting = SORTING_IS_ON(qopt);
+
+ /* Select queueing mode based on parameters. */
+ setup_queueing_mode(q);
+
+ qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
+
+ return 0;
+}
+
+static void timesortedlist_clear(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct rb_node *p = rb_first(&q->head);
+
+ while (p) {
+ struct sk_buff *skb = rb_to_skb(p);
+
+ p = rb_next(p);
+
+ rb_erase(&skb->rbnode, &q->head);
+ rtnl_kfree_skbs(skb, skb);
+ sch->q.qlen--;
+ }
+}
+
+static void tbs_reset(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+
+ /* Only cancel watchdog if it's been initialized. */
+ if (q->watchdog.qdisc == sch)
+ qdisc_watchdog_cancel(&q->watchdog);
+
+ /* No matter which mode we are on, it's safe to clear both lists. */
+ timesortedlist_clear(sch);
+ __qdisc_reset_queue(&sch->q);
+
+ sch->qstats.backlog = 0;
+ sch->q.qlen = 0;
+
+ q->last = 0;
+}
+
+static void tbs_destroy(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+
+ /* Only cancel watchdog if it's been initialized. */
+ if (q->watchdog.qdisc == sch)
+ qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct tc_tbs_qopt opt = { };
+ struct nlattr *nest;
+
+ nest = nla_nest_start(skb, TCA_OPTIONS);
+ if (!nest)
+ goto nla_put_failure;
+
+ opt.delta = q->delta;
+ opt.clockid = q->clockid;
+ if (q->sorting)
+ opt.flags |= TC_TBS_SORTING_ON;
+
+ if (nla_put(skb, TCA_TBS_PARMS, sizeof(opt), &opt))
+ goto nla_put_failure;
+
+ return nla_nest_end(skb, nest);
+
+nla_put_failure:
+ nla_nest_cancel(skb, nest);
+ return -1;
+}
+
+static struct Qdisc_ops tbs_qdisc_ops __read_mostly = {
+ .id = "tbs",
+ .priv_size = sizeof(struct tbs_sched_data),
+ .enqueue = tbs_enqueue,
+ .dequeue = tbs_dequeue,
+ .peek = tbs_peek,
+ .init = tbs_init,
+ .reset = tbs_reset,
+ .destroy = tbs_destroy,
+ .dump = tbs_dump,
+ .owner = THIS_MODULE,
+};
+
+static int __init tbs_module_init(void)
+{
+ return register_qdisc(&tbs_qdisc_ops);
+}
+
+static void __exit tbs_module_exit(void)
+{
+ unregister_qdisc(&tbs_qdisc_ops);
+}
+module_init(tbs_module_init)
+module_exit(tbs_module_exit)
+MODULE_LICENSE("GPL");
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (12 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-21 14:22 ` Thomas Gleixner
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs() Jesus Sanchez-Palencia
` (5 subsequent siblings)
19 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Add new queueing modes to tbs qdisc so HW offload is supported.
For hw offload, if sorting is on, then the time sorted list will still
be used, but when sorting is disabled the enqueue / dequeue flow will
be based on a 'raw' FIFO through the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(). For the 'raw hw offload' mode, the drop_if_late
flag from skbuffs is not used by the Qdisc since this mode implicitly
assumes the PHC clock is being used by applications.
Example 1:
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 tbs offload
In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed the timestamp
in skbuffs are in reference to the interface's PHC and setting any other
valid clockid would be treated as an error. Because there is no
scheduling being performed in the qdisc, setting a delta != 0 would also
be considered an error.
Example 2:
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
clockid CLOCK_REALTIME sorting
Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
reference and packets leave the Qdisc "delta" (100000) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
include/net/pkt_sched.h | 5 ++
include/uapi/linux/pkt_sched.h | 1 +
net/sched/sch_tbs.c | 159 +++++++++++++++++++++++++++++++++++------
3 files changed, 144 insertions(+), 21 deletions(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..d042ffda7f21 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
};
+struct tc_tbs_qopt_offload {
+ u8 enable;
+ s32 queue;
+};
+
#endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a33b5b9da81a..92af9fa4dee4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -941,6 +941,7 @@ struct tc_tbs_qopt {
__s32 clockid;
__u32 flags;
#define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
};
enum {
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
index c19eedda9bc5..2aafa55de42c 100644
--- a/net/sched/sch_tbs.c
+++ b/net/sched/sch_tbs.c
@@ -25,8 +25,10 @@
#include <net/sock.h>
#define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+#define OFFLOAD_IS_ON(x) (x->flags & TC_TBS_OFFLOAD_ON)
struct tbs_sched_data {
+ bool offload;
bool sorting;
int clockid;
int queue;
@@ -68,25 +70,42 @@ static inline int validate_input_params(struct tc_tbs_qopt *qopt,
struct netlink_ext_ack *extack)
{
/* Check if params comply to the following rules:
- * * If SW best-effort, then clockid and delta must be valid
- * regardless of sorting enabled or not.
+ * * If SW best-effort, then clockid and delta must be valid.
+ *
+ * * If HW offload is ON and sorting is ON, then clockid and delta
+ * must be valid.
+ *
+ * * If HW offload is ON and sorting is OFF, then clockid and
+ * delta must not have been set. The netdevice PHC will be used
+ * implictly.
*
* * Dynamic clockids are not supported.
* * Delta must be a positive integer.
*/
- if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
- qopt->clockid >= MAX_CLOCKS) {
- NL_SET_ERR_MSG(extack, "Invalid clockid");
- return -EINVAL;
- } else if (qopt->clockid < 0 ||
- !clockid_to_get_time[qopt->clockid]) {
- NL_SET_ERR_MSG(extack, "Clockid is not supported");
- return -ENOTSUPP;
- }
-
- if (qopt->delta < 0) {
- NL_SET_ERR_MSG(extack, "Delta must be positive");
- return -EINVAL;
+ if (!OFFLOAD_IS_ON(qopt) || SORTING_IS_ON(qopt)) {
+ if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+ qopt->clockid >= MAX_CLOCKS) {
+ NL_SET_ERR_MSG(extack, "Invalid clockid");
+ return -EINVAL;
+ } else if (qopt->clockid < 0 ||
+ !clockid_to_get_time[qopt->clockid]) {
+ NL_SET_ERR_MSG(extack, "Clockid is not supported");
+ return -ENOTSUPP;
+ }
+
+ if (qopt->delta < 0) {
+ NL_SET_ERR_MSG(extack, "Delta must be positive");
+ return -EINVAL;
+ }
+ } else {
+ if (qopt->delta != 0) {
+ NL_SET_ERR_MSG(extack, "Cannot set delta for this mode");
+ return -EINVAL;
+ }
+ if ((qopt->clockid & CLOCKID_INVALID) != CLOCKID_INVALID) {
+ NL_SET_ERR_MSG(extack, "Cannot set clockid for this mode");
+ return -EINVAL;
+ }
}
return 0;
@@ -155,6 +174,15 @@ static int tbs_enqueue(struct sk_buff *nskb, struct Qdisc *sch,
return q->enqueue(nskb, sch, to_free);
}
+static int tbs_enqueue_fifo(struct sk_buff *nskb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+ if (!is_packet_valid(sch, nskb))
+ return qdisc_drop(nskb, sch, to_free);
+
+ return qdisc_enqueue_tail(nskb, sch);
+}
+
static int tbs_enqueue_scheduledfifo(struct sk_buff *nskb, struct Qdisc *sch,
struct sk_buff **to_free)
{
@@ -242,6 +270,21 @@ static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
return q->dequeue(sch);
}
+static struct sk_buff *tbs_dequeue_fifo(struct Qdisc *sch)
+{
+ struct tbs_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb = qdisc_dequeue_head(sch);
+
+ /* XXX: The drop_if_late bit is not checked here because that would
+ * require the PHC time to be read directly.
+ */
+
+ if (skb)
+ q->last = skb->tstamp;
+
+ return skb;
+}
+
static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
{
struct tbs_sched_data *q = qdisc_priv(sch);
@@ -318,6 +361,56 @@ static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
return skb;
}
+static void tbs_disable_offload(struct net_device *dev,
+ struct tbs_sched_data *q)
+{
+ struct tc_tbs_qopt_offload tbs = { };
+ const struct net_device_ops *ops;
+ int err;
+
+ if (!q->offload)
+ return;
+
+ ops = dev->netdev_ops;
+ if (!ops->ndo_setup_tc)
+ return;
+
+ tbs.queue = q->queue;
+ tbs.enable = 0;
+
+ err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TBS, &tbs);
+ if (err < 0)
+ pr_warn("Couldn't disable TBS offload for queue %d\n",
+ tbs.queue);
+}
+
+static int tbs_enable_offload(struct net_device *dev, struct tbs_sched_data *q,
+ struct netlink_ext_ack *extack)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+ struct tc_tbs_qopt_offload tbs = { };
+ int err;
+
+ if (q->offload)
+ return 0;
+
+ if (!ops->ndo_setup_tc) {
+ NL_SET_ERR_MSG(extack, "Specified device does not support TBS offload");
+ return -EOPNOTSUPP;
+ }
+
+ tbs.queue = q->queue;
+ tbs.enable = 1;
+
+ err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TBS, &tbs);
+ if (err < 0) {
+ NL_SET_ERR_MSG(extack, "Specified device failed to setup TBS hardware offload");
+ return err;
+ }
+
+ return 0;
+}
+
static inline void setup_queueing_mode(struct tbs_sched_data *q)
{
if (q->sorting) {
@@ -325,9 +418,15 @@ static inline void setup_queueing_mode(struct tbs_sched_data *q)
q->dequeue = tbs_dequeue_timesortedlist;
q->peek = tbs_peek_timesortedlist;
} else {
- q->enqueue = tbs_enqueue_scheduledfifo;
- q->dequeue = tbs_dequeue_scheduledfifo;
- q->peek = qdisc_peek_head;
+ if (q->offload) {
+ q->enqueue = tbs_enqueue_fifo;
+ q->dequeue = tbs_dequeue_fifo;
+ q->peek = qdisc_peek_head;
+ } else {
+ q->enqueue = tbs_enqueue_scheduledfifo;
+ q->dequeue = tbs_dequeue_scheduledfifo;
+ q->peek = qdisc_peek_head;
+ }
}
}
@@ -356,8 +455,9 @@ static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
qopt = nla_data(tb[TCA_TBS_PARMS]);
- pr_debug("delta %d clockid %d sorting %s\n",
+ pr_debug("delta %d clockid %d offload %s sorting %s\n",
qopt->delta, qopt->clockid,
+ OFFLOAD_IS_ON(qopt) ? "on" : "off",
SORTING_IS_ON(qopt) ? "on" : "off");
err = validate_input_params(qopt, extack);
@@ -366,15 +466,26 @@ static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
+ if (OFFLOAD_IS_ON(qopt)) {
+ err = tbs_enable_offload(dev, q, extack);
+ if (err < 0)
+ return err;
+ }
+
/* Everything went OK, save the parameters used. */
q->delta = qopt->delta;
q->clockid = qopt->clockid;
+ q->offload = OFFLOAD_IS_ON(qopt);
q->sorting = SORTING_IS_ON(qopt);
- /* Select queueing mode based on parameters. */
+ /* Select queueing mode based on offload and sorting parameters. */
setup_queueing_mode(q);
- qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
+ /* The watchdog will be needed for SW best-effort or if TxTime
+ * based sorting is on.
+ */
+ if (!q->offload || q->sorting)
+ qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
return 0;
}
@@ -416,10 +527,13 @@ static void tbs_reset(struct Qdisc *sch)
static void tbs_destroy(struct Qdisc *sch)
{
struct tbs_sched_data *q = qdisc_priv(sch);
+ struct net_device *dev = qdisc_dev(sch);
/* Only cancel watchdog if it's been initialized. */
if (q->watchdog.qdisc == sch)
qdisc_watchdog_cancel(&q->watchdog);
+
+ tbs_disable_offload(dev, q);
}
static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
@@ -434,6 +548,9 @@ static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
opt.delta = q->delta;
opt.clockid = q->clockid;
+ if (q->offload)
+ opt.flags |= TC_TBS_OFFLOAD_ON;
+
if (q->sorting)
opt.flags |= TC_TBS_SORTING_ON;
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs()
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (13 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on Jesus Sanchez-Palencia
` (4 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".
Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 54 ++++++++++++++-----------------
1 file changed, 25 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae785369..49cfbe4fd2b1 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1673,23 +1673,17 @@ static void set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
}
/**
- * igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ * igb_config_tx_modes - Configure "Qav Tx mode" features on igb
* @adapter: pointer to adapter struct
* @queue: queue number
- * @enable: true = enable CBS, false = disable CBS
- * @idleslope: idleSlope in kbps
- * @sendslope: sendSlope in kbps
- * @hicredit: hiCredit in bytes
- * @locredit: loCredit in bytes
*
- * Configure CBS for a given hardware queue. When disabling, idleslope,
- * sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- * success. Negative otherwise.
+ * Configure CBS for a given hardware queue. Parameters are retrieved
+ * from the correct Tx ring, so igb_save_cbs_params() should be used
+ * for setting those correctly prior to this function being called.
**/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
- bool enable, int idleslope, int sendslope,
- int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
{
+ struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = &adapter->hw;
u32 tqavcc;
@@ -1698,7 +1692,7 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
- if (enable) {
+ if (ring->cbs_enable) {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
@@ -1759,14 +1753,15 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
* calculated value, so the resulting bandwidth might
* be slightly higher for some configurations.
*/
- value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 1000000);
+ value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 1000000);
tqavcc = rd32(E1000_I210_TQAVCC(queue));
tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
tqavcc |= value;
wr32(E1000_I210_TQAVCC(queue), tqavcc);
- wr32(E1000_I210_TQAVHC(queue), 0x80000000 + hicredit * 0x7735);
+ wr32(E1000_I210_TQAVHC(queue),
+ 0x80000000 + ring->hicredit * 0x7735);
} else {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1786,8 +1781,9 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
*/
netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit %d locredit %d\n",
- (enable) ? "enabled" : "disabled", queue,
- idleslope, sendslope, hicredit, locredit);
+ (ring->cbs_enable) ? "enabled" : "disabled", queue,
+ ring->idleslope, ring->sendslope, ring->hicredit,
+ ring->locredit);
}
static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
@@ -1812,19 +1808,25 @@ static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
static bool is_any_cbs_enabled(struct igb_adapter *adapter)
{
- struct igb_ring *ring;
int i;
for (i = 0; i < adapter->num_tx_queues; i++) {
- ring = adapter->tx_ring[i];
-
- if (ring->cbs_enable)
+ if (adapter->tx_ring[i]->cbs_enable)
return true;
}
return false;
}
+/**
+ * igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
+ * @adapter: pointer to adapter struct
+ *
+ * Configure TQAVCTRL register switching the controller's Tx mode
+ * if FQTSS mode is enabled or disabled. Additionally, will issue
+ * a call to igb_config_tx_modes() per queue so any previously saved
+ * Tx parameters are applied.
+ **/
static void igb_setup_tx_mode(struct igb_adapter *adapter)
{
struct net_device *netdev = adapter->netdev;
@@ -1884,11 +1886,7 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
adapter->num_tx_queues : I210_SR_QUEUES_NUM;
for (i = 0; i < max_queue; i++) {
- struct igb_ring *ring = adapter->tx_ring[i];
-
- igb_configure_cbs(adapter, i, ring->cbs_enable,
- ring->idleslope, ring->sendslope,
- ring->hicredit, ring->locredit);
+ igb_config_tx_modes(adapter, i);
}
} else {
wr32(E1000_RXPBS, I210_RXPBSIZE_DEFAULT);
@@ -2482,9 +2480,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return err;
if (is_fqtss_enabled(adapter)) {
- igb_configure_cbs(adapter, qopt->queue, qopt->enable,
- qopt->idleslope, qopt->sendslope,
- qopt->hicredit, qopt->locredit);
+ igb_config_tx_modes(adapter, qopt->queue);
if (!is_any_cbs_enabled(adapter))
enable_fqtss(adapter, false);
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (14 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs() Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs() Jesus Sanchez-Palencia
` (3 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.
Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.
Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 49 +++++++++++++++++++++----------
1 file changed, 33 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 49cfbe4fd2b1..9c33f2d18d8c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1672,6 +1672,18 @@ static void set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
wr32(E1000_I210_TQAVCC(queue), val);
}
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+ int i;
+
+ for (i = 0; i < adapter->num_tx_queues; i++) {
+ if (adapter->tx_ring[i]->cbs_enable)
+ return true;
+ }
+
+ return false;
+}
+
/**
* igb_config_tx_modes - Configure "Qav Tx mode" features on igb
* @adapter: pointer to adapter struct
@@ -1686,7 +1698,7 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = &adapter->hw;
- u32 tqavcc;
+ u32 tqavcc, tqavctrl;
u16 value;
WARN_ON(hw->mac.type != e1000_i210);
@@ -1696,6 +1708,14 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+ /* Always set data transfer arbitration to credit-based
+ * shaper algorithm on TQAVCTRL if CBS is enabled for any of
+ * the queues.
+ */
+ tqavctrl = rd32(E1000_I210_TQAVCTRL);
+ tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+ wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
/* According to i210 datasheet section 7.2.7.7, we should set
* the 'idleSlope' field from TQAVCC register following the
* equation:
@@ -1773,6 +1793,16 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
/* Set hiCredit to zero. */
wr32(E1000_I210_TQAVHC(queue), 0);
+
+ /* If CBS is not enabled for any queues anymore, then return to
+ * the default state of Data Transmission Arbitration on
+ * TQAVCTRL.
+ */
+ if (!is_any_cbs_enabled(adapter)) {
+ tqavctrl = rd32(E1000_I210_TQAVCTRL);
+ tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+ wr32(E1000_I210_TQAVCTRL, tqavctrl);
+ }
}
/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1806,18 +1836,6 @@ static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
return 0;
}
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
- int i;
-
- for (i = 0; i < adapter->num_tx_queues; i++) {
- if (adapter->tx_ring[i]->cbs_enable)
- return true;
- }
-
- return false;
-}
-
/**
* igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
* @adapter: pointer to adapter struct
@@ -1841,11 +1859,10 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
int i, max_queue;
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
- * set data fetch arbitration to 'round robin' and set data
- * transfer arbitration to 'credit shaper algorithm.
+ * set data fetch arbitration to 'round robin'.
*/
val = rd32(E1000_I210_TQAVCTRL);
- val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+ val |= E1000_TQAVCTRL_XMIT_MODE;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs()
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (15 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 18/18] igb: Add support for TBS offload Jesus Sanchez-Palencia
` (2 subsequent siblings)
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Split code into a separate function (igb_offload_apply()) that will be
used by TBS offload implementation.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9c33f2d18d8c..10d7809a85d7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2476,6 +2476,19 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev,
return features;
}
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+ if (!is_fqtss_enabled(adapter)) {
+ enable_fqtss(adapter, true);
+ return;
+ }
+
+ igb_config_tx_modes(adapter, queue);
+
+ if (!is_any_cbs_enabled(adapter))
+ enable_fqtss(adapter, false);
+}
+
static int igb_offload_cbs(struct igb_adapter *adapter,
struct tc_cbs_qopt_offload *qopt)
{
@@ -2496,15 +2509,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
if (err)
return err;
- if (is_fqtss_enabled(adapter)) {
- igb_config_tx_modes(adapter, qopt->queue);
-
- if (!is_any_cbs_enabled(adapter))
- enable_fqtss(adapter, false);
-
- } else {
- enable_fqtss(adapter, true);
- }
+ igb_offload_apply(adapter, qopt->queue);
return 0;
}
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 18/18] igb: Add support for TBS offload
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (16 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs() Jesus Sanchez-Palencia
@ 2018-03-07 1:12 ` Jesus Sanchez-Palencia
2018-03-07 5:28 ` [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Richard Cochran
2018-03-08 14:09 ` Henrik Austad
19 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 1:12 UTC (permalink / raw)
To: intel-wired-lan
Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_TBS and configuring i210 so time based transmit
arbitration is enabled.
The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).
As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.
The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the TBS qdisc already.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
drivers/net/ethernet/intel/igb/e1000_defines.h | 16 +++
drivers/net/ethernet/intel/igb/igb.h | 1 +
drivers/net/ethernet/intel/igb/igb_main.c | 135 ++++++++++++++++++++++---
3 files changed, 137 insertions(+), 15 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..9e357848c550 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1066,6 +1066,22 @@
#define E1000_TQAVCTRL_XMIT_MODE BIT(0)
#define E1000_TQAVCTRL_DATAFETCHARB BIT(4)
#define E1000_TQAVCTRL_DATATRANARB BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA (0xFFFF << 16)
/* TX Qav Credit Control fields */
#define E1000_TQAVCC_IDLESLOPE_MASK 0xFFFF
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..4e1146efa399 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,7 @@ struct igb_ring {
u16 count; /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+ bool launchtime_enable; /* true if LaunchTime is enabled */
bool cbs_enable; /* indicates if CBS is enabled */
s32 idleslope; /* idleSlope in kbps */
s32 sendslope; /* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 10d7809a85d7..fa931f66a1f8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1684,13 +1684,26 @@ static bool is_any_cbs_enabled(struct igb_adapter *adapter)
return false;
}
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+ int i;
+
+ for (i = 0; i < adapter->num_tx_queues; i++) {
+ if (adapter->tx_ring[i]->launchtime_enable)
+ return true;
+ }
+
+ return false;
+}
+
/**
* igb_config_tx_modes - Configure "Qav Tx mode" features on igb
* @adapter: pointer to adapter struct
* @queue: queue number
*
- * Configure CBS for a given hardware queue. Parameters are retrieved
- * from the correct Tx ring, so igb_save_cbs_params() should be used
+ * Configure CBS and Launchtime for a given hardware queue.
+ * Parameters are retrieved from the correct Tx ring, so
+ * igb_save_cbs_params() and igb_save_txtime_params() should be used
* for setting those correctly prior to this function being called.
**/
static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1704,10 +1717,20 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
- if (ring->cbs_enable) {
+ /* If any of the Qav features is enabled, configure queues as SR and
+ * with HIGH PRIO. If none is, then configure them with LOW PRIO and
+ * as SP.
+ */
+ if (ring->cbs_enable || ring->launchtime_enable) {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+ } else {
+ set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
+ set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
+ }
+ /* If CBS is enabled, set DataTranARB and config its parameters. */
+ if (ring->cbs_enable) {
/* Always set data transfer arbitration to credit-based
* shaper algorithm on TQAVCTRL if CBS is enabled for any of
* the queues.
@@ -1783,8 +1806,6 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
wr32(E1000_I210_TQAVHC(queue),
0x80000000 + ring->hicredit * 0x7735);
} else {
- set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
- set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
/* Set idleSlope to zero. */
tqavcc = rd32(E1000_I210_TQAVCC(queue));
@@ -1805,17 +1826,61 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
}
}
+ /* If LaunchTime is enabled, set DataTranTIM. */
+ if (ring->launchtime_enable) {
+ /* Always set DataTranTIM on TQAVCTRL if LaunchTime is enabled
+ * for any of the SR queues, and configure fetchtime delta.
+ * XXX NOTE:
+ * - LaunchTime will be enabled for all SR queues.
+ * - A fixed offset can be added relative to the launch
+ * time of all packets if configured at reg LAUNCH_OS0.
+ * We are keeping it as 0 for now (default value).
+ */
+ tqavctrl = rd32(E1000_I210_TQAVCTRL);
+ tqavctrl |= E1000_TQAVCTRL_DATATRANTIM |
+ E1000_TQAVCTRL_FETCHTIME_DELTA;
+ wr32(E1000_I210_TQAVCTRL, tqavctrl);
+ } else {
+ /* If Launchtime is not enabled for any SR queues anymore,
+ * then clear DataTranTIM on TQAVCTRL and clear fetchtime delta,
+ * effectively disabling Launchtime.
+ */
+ if (!is_any_txtime_enabled(adapter)) {
+ tqavctrl = rd32(E1000_I210_TQAVCTRL);
+ tqavctrl &= ~E1000_TQAVCTRL_DATATRANTIM;
+ tqavctrl &= ~E1000_TQAVCTRL_FETCHTIME_DELTA;
+ wr32(E1000_I210_TQAVCTRL, tqavctrl);
+ }
+ }
+
/* XXX: In i210 controller the sendSlope and loCredit parameters from
* CBS are not configurable by software so we don't do any 'controller
* configuration' in respect to these parameters.
*/
- netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit %d locredit %d\n",
- (ring->cbs_enable) ? "enabled" : "disabled", queue,
+ netdev_dbg(netdev, "Qav Tx mode: cbs %s, launchtime %s, queue %d \
+ idleslope %d sendslope %d hiCredit %d \
+ locredit %d\n",
+ (ring->cbs_enable) ? "enabled" : "disabled",
+ (ring->launchtime_enable) ? "enabled" : "disabled", queue,
ring->idleslope, ring->sendslope, ring->hicredit,
ring->locredit);
}
+static int igb_save_txtime_params(struct igb_adapter *adapter, int queue,
+ bool enable)
+{
+ struct igb_ring *ring;
+
+ if (queue < 0 || queue > adapter->num_tx_queues)
+ return -EINVAL;
+
+ ring = adapter->tx_ring[queue];
+ ring->launchtime_enable = enable;
+
+ return 0;
+}
+
static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
bool enable, int idleslope, int sendslope,
int hicredit, int locredit)
@@ -1859,10 +1924,11 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
int i, max_queue;
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
- * set data fetch arbitration to 'round robin'.
+ * set data fetch arbitration to 'round robin', set SP_WAIT_SR
+ * so SP queues wait for SR ones.
*/
val = rd32(E1000_I210_TQAVCTRL);
- val |= E1000_TQAVCTRL_XMIT_MODE;
+ val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_SP_WAIT_SR;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
@@ -2485,7 +2551,7 @@ static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
igb_config_tx_modes(adapter, queue);
- if (!is_any_cbs_enabled(adapter))
+ if (!is_any_cbs_enabled(adapter) && !is_any_txtime_enabled(adapter))
enable_fqtss(adapter, false);
}
@@ -2514,6 +2580,30 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
}
+static int igb_offload_txtime(struct igb_adapter *adapter,
+ struct tc_tbs_qopt_offload *qopt)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ int err;
+
+ /* Launchtime offloading is only supported by i210 controller. */
+ if (hw->mac.type != e1000_i210)
+ return -EOPNOTSUPP;
+
+ /* Launchtime offloading is only supported by queues 0 and 1. */
+ if (qopt->queue < 0 || qopt->queue > 1)
+ return -EINVAL;
+
+ err = igb_save_txtime_params(adapter, qopt->queue, qopt->enable);
+
+ if (err)
+ return err;
+
+ igb_offload_apply(adapter, qopt->queue);
+
+ return 0;
+}
+
static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
void *type_data)
{
@@ -2522,6 +2612,8 @@ static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
switch (type) {
case TC_SETUP_QDISC_CBS:
return igb_offload_cbs(adapter, type_data);
+ case TC_SETUP_QDISC_TBS:
+ return igb_offload_txtime(adapter, type_data);
default:
return -EOPNOTSUPP;
@@ -5333,11 +5425,14 @@ static void igb_set_itr(struct igb_q_vector *q_vector)
}
}
-static void igb_tx_ctxtdesc(struct igb_ring *tx_ring, u32 vlan_macip_lens,
- u32 type_tucmd, u32 mss_l4len_idx)
+static void igb_tx_ctxtdesc(struct igb_ring *tx_ring,
+ struct igb_tx_buffer *first,
+ u32 vlan_macip_lens, u32 type_tucmd,
+ u32 mss_l4len_idx)
{
struct e1000_adv_tx_context_desc *context_desc;
u16 i = tx_ring->next_to_use;
+ struct timespec64 ts;
context_desc = IGB_TX_CTXTDESC(tx_ring, i);
@@ -5352,9 +5447,18 @@ static void igb_tx_ctxtdesc(struct igb_ring *tx_ring, u32 vlan_macip_lens,
mss_l4len_idx |= tx_ring->reg_idx << 4;
context_desc->vlan_macip_lens = cpu_to_le32(vlan_macip_lens);
- context_desc->seqnum_seed = 0;
context_desc->type_tucmd_mlhl = cpu_to_le32(type_tucmd);
context_desc->mss_l4len_idx = cpu_to_le32(mss_l4len_idx);
+
+ /* We assume there is always a valid tx time available. Invalid times
+ * should have been handled by the upper layers.
+ */
+ if (tx_ring->launchtime_enable) {
+ ts = ns_to_timespec64(first->skb->tstamp);
+ context_desc->seqnum_seed = cpu_to_le32(ts.tv_nsec / 32);
+ } else {
+ context_desc->seqnum_seed = 0;
+ }
}
static int igb_tso(struct igb_ring *tx_ring,
@@ -5437,7 +5541,8 @@ static int igb_tso(struct igb_ring *tx_ring,
vlan_macip_lens |= (ip.hdr - skb->data) << E1000_ADVTXD_MACLEN_SHIFT;
vlan_macip_lens |= first->tx_flags & IGB_TX_FLAGS_VLAN_MASK;
- igb_tx_ctxtdesc(tx_ring, vlan_macip_lens, type_tucmd, mss_l4len_idx);
+ igb_tx_ctxtdesc(tx_ring, first, vlan_macip_lens,
+ type_tucmd, mss_l4len_idx);
return 1;
}
@@ -5492,7 +5597,7 @@ static void igb_tx_csum(struct igb_ring *tx_ring, struct igb_tx_buffer *first)
vlan_macip_lens |= skb_network_offset(skb) << E1000_ADVTXD_MACLEN_SHIFT;
vlan_macip_lens |= first->tx_flags & IGB_TX_FLAGS_VLAN_MASK;
- igb_tx_ctxtdesc(tx_ring, vlan_macip_lens, type_tucmd, 0);
+ igb_tx_ctxtdesc(tx_ring, first, vlan_macip_lens, type_tucmd, 0);
}
#define IGB_SET_FLAG(_input, _flag, _result) \
--
2.16.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params Jesus Sanchez-Palencia
@ 2018-03-07 2:53 ` Eric Dumazet
2018-03-07 5:24 ` Richard Cochran
2018-03-07 21:52 ` Jesus Sanchez-Palencia
0 siblings, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2018-03-07 2:53 UTC (permalink / raw)
To: intel-wired-lan
On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
> a drop_if_late flag. With this commit the API becomes:
>
>
* diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
* index d8340e6e8814..951969ceaf65 100644
* --- a/include/linux/skbuff.h
* +++ b/include/linux/skbuff.h
* @@ -788,6 +788,9 @@ struct sk_buff {
* ? __u8 tc_redirected:1;
* ? __u8 tc_from_ingress:1;
* ?#endif
* + __u8 tc_drop_if_late:1;
* +
* + clockid_t txtime_clockid;
* ?
* ?#ifdef CONFIG_NET_SCHED
* ? __u16 tc_index; /* traffic
control index */
This is adding 32+1 bits to sk_buff, and possibly holes in this very
very hot (and already too fat) structure.
Do we really need 32 bits for a clockid_t ?
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 2:53 ` Eric Dumazet
@ 2018-03-07 5:24 ` Richard Cochran
2018-03-07 17:01 ` Willem de Bruijn
2018-03-21 12:58 ` Thomas Gleixner
2018-03-07 21:52 ` Jesus Sanchez-Palencia
1 sibling, 2 replies; 52+ messages in thread
From: Richard Cochran @ 2018-03-07 5:24 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.
>
> Do we really need 32 bits for a clockid_t ?
Probably we can live with fewer bits.
For clock IDs with a positive sign, the max possible clock value is 16.
For clock IDs with a negative sign, IIRC, three bits are for the type
code (we have also posix timers packed like this) and the are for the
file descriptor. So maybe we could use 16 bits, allowing 12 bits or
so for encoding the FD.
The downside would be that this forces the application to make sure
and open the dynamic posix clock early enough before the FD count gets
too high.
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (17 preceding siblings ...)
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 18/18] igb: Add support for TBS offload Jesus Sanchez-Palencia
@ 2018-03-07 5:28 ` Richard Cochran
2018-03-08 14:09 ` Henrik Austad
19 siblings, 0 replies; 52+ messages in thread
From: Richard Cochran @ 2018-03-07 5:28 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> Design changes since v2:
> - Now on the dequeue() path, tbs only drops an expired packet if it has the
> skb->tc_drop_if_late flag set. In practical terms, this will define if
> the semantics of txtime on a system is "not earlier than" or "not later
> than" a given timestamp;
> - Now on the enqueue() path, the qdisc will drop a packet if its clockid
> doesn't match the qdisc's one;
> - Sorting the packets based on their txtime is now an option for the disc.
> Effectively, this means it can be configured in 4 modes: HW offload or
> SW best-effort, sorting enabled or disabled;
While all of this makes the series and the configuration more complex,
still I like the fact that the interface offers these different modes.
Looking forward to testing this...
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case Jesus Sanchez-Palencia
@ 2018-03-07 16:58 ` Willem de Bruijn
0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-03-07 16:58 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
> ret values to be overwritten by the one set on the default case.
>
> Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Please send this fix to net-next independent from the rest of the patchset.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path Jesus Sanchez-Palencia
@ 2018-03-07 16:59 ` Willem de Bruijn
2018-03-07 22:03 ` Jesus Sanchez-Palencia
0 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-03-07 16:59 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> This is done in preparation for the upcoming time based transmission
> patchset. Now that skb->tstamp will be used to hold packet's txtime,
> we must ensure that it is being cleared when traversing namespaces.
> Also, doing that from skb_scrub_packet() would break our feature when
> tunnels are used.
Then the right location to move to is skb_scrub_packet below the test for xnet.
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
> include/linux/netdevice.h | 1 +
> net/core/skbuff.c | 1 -
> 2 files changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dbe6344b727a..7104de2bc957 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>
> skb_scrub_packet(skb, true);
> skb->priority = 0;
> + skb->tstamp = 0;
> return 0;
> }
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c13495ba6..678fc5416ae1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
> */
> void skb_scrub_packet(struct sk_buff *skb, bool xnet)
> {
> - skb->tstamp = 0;
> skb->pkt_type = PACKET_HOST;
> skb->skb_iif = 0;
> skb->ignore_df = 0;
> --
> 2.16.2
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission Jesus Sanchez-Palencia
@ 2018-03-07 17:00 ` Willem de Bruijn
0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:00 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> From: Richard Cochran <rcochran@linutronix.de>
>
> For raw packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
>
> Signed-off-by: Richard Cochran <rcochran@linutronix.de>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
> net/ipv4/raw.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 54648d20bf0f..8e05970ba7c4 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
>
> skb->priority = sk->sk_priority;
> skb->mark = sk->sk_mark;
> + skb->tstamp = sockc->transmit_time;
This implements the feature only for the hdrincl case and silently
drops the txtime request on other raw sockets (incl. corked).
At the least, should probably fail if sockc.transmit_time is non-zero
and the hdrincl path is not taken. Or implement by passing through
inet_cork and set in __ip_make_skb. Then be careful to ignore the
field for other protocols, where it may be uninitialized.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: " Jesus Sanchez-Palencia
@ 2018-03-07 17:00 ` Willem de Bruijn
0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:00 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> From: Richard Cochran <rcochran@linutronix.de>
>
> For udp packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
>
> Signed-off-by: Richard Cochran <rcochran@linutronix.de>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
> net/ipv4/udp.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 3013404d0935..d683bbde526b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> }
>
> ipc.sockc.tsflags = sk->sk_tsflags;
> + ipc.sockc.transmit_time = 0;
> ipc.addr = inet->inet_saddr;
> ipc.oif = sk->sk_bound_dev_if;
>
> @@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> sizeof(struct udphdr), &ipc, &rt,
> msg->msg_flags);
> err = PTR_ERR(skb);
> - if (!IS_ERR_OR_NULL(skb))
> + if (!IS_ERR_OR_NULL(skb)) {
> + skb->tstamp = ipc.sockc.transmit_time;
> err = udp_send_skb(skb, fl4);
> + }
similar comment to raw: this implements only for a subset of udp requests:
those that can take the fast path.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 5:24 ` Richard Cochran
@ 2018-03-07 17:01 ` Willem de Bruijn
2018-03-07 17:35 ` Richard Cochran
2018-03-21 12:58 ` Thomas Gleixner
1 sibling, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:01 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 7, 2018 at 12:24 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
>> This is adding 32+1 bits to sk_buff, and possibly holes in this very
>> very hot (and already too fat) structure.
>>
>> Do we really need 32 bits for a clockid_t ?
>
> Probably we can live with fewer bits.
>
> For clock IDs with a positive sign, the max possible clock value is 16.
>
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor. So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
>
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.
The same choices are probably made for all packets on a given
socket. Unless skb->sk gets scrubbed in some transmit paths,
then these be set as sockopt instead of cmsg.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 17:01 ` Willem de Bruijn
@ 2018-03-07 17:35 ` Richard Cochran
2018-03-07 17:37 ` Richard Cochran
0 siblings, 1 reply; 52+ messages in thread
From: Richard Cochran @ 2018-03-07 17:35 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 07, 2018 at 12:01:19PM -0500, Willem de Bruijn wrote:
> The same choices are probably made for all packets on a given
> socket. Unless skb->sk gets scrubbed in some transmit paths,
> then these be set as sockopt instead of cmsg.
The discussion on v2 ended with this per-message idea, in preference
to the per-socket idea, IIRC.
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 17:35 ` Richard Cochran
@ 2018-03-07 17:37 ` Richard Cochran
2018-03-07 17:47 ` Eric Dumazet
0 siblings, 1 reply; 52+ messages in thread
From: Richard Cochran @ 2018-03-07 17:37 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote:
> The discussion on v2 ended with this per-message idea, in preference
> to the per-socket idea, IIRC.
(But my own opinion is that per-socket is good enough...)
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 17:37 ` Richard Cochran
@ 2018-03-07 17:47 ` Eric Dumazet
2018-03-08 16:44 ` Richard Cochran
0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2018-03-07 17:47 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 7, 2018 at 9:37 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote:
>> The discussion on v2 ended with this per-message idea, in preference
>> to the per-socket idea, IIRC.
>
> (But my own opinion is that per-socket is good enough...)
>
> Thanks,
> Richard
I would love if skb->tstamp could be either 0 or expressed in
ktime_get() base all the time.
( Even if we would have to convert this to other bases when/if needed)
Having to deal with many clockid in the core networking stack seems
over engineered.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 2:53 ` Eric Dumazet
2018-03-07 5:24 ` Richard Cochran
@ 2018-03-07 21:52 ` Jesus Sanchez-Palencia
2018-03-07 22:45 ` Eric Dumazet
1 sibling, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 21:52 UTC (permalink / raw)
To: intel-wired-lan
Hi,
On 03/06/2018 06:53 PM, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
>> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
>> a drop_if_late flag. With this commit the API becomes:
>>
>>
>
> * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> * index d8340e6e8814..951969ceaf65 100644
> * --- a/include/linux/skbuff.h
> * +++ b/include/linux/skbuff.h
> * @@ -788,6 +788,9 @@ struct sk_buff {
> * ? __u8 tc_redirected:1;
> * ? __u8 tc_from_ingress:1;
> * ?#endif
> * + __u8 tc_drop_if_late:1;
> * +
> * + clockid_t txtime_clockid;
> * ?
> * ?#ifdef CONFIG_NET_SCHED
> * ? __u16 tc_index; /* traffic
> control index */
>
>
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.
I should have mentioned on the commit msg, but the tc_drop_if_late is actually
filling a 1 bit hole that was already there.
>
> Do we really need 32 bits for a clockid_t ?
There is a 2 bytes hole just after tc_index, so a u16 clockid would fit
perfectly without increasing the skbuffs size / cachelines any further.
From Richard's reply, it seems safe to just change the definition here if we
make it explicit on the SCM_CLOCKID documentation the caveat about the max
possible fd count for dynamic clocks.
How does that sound?
Thanks,
Jesus
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
2018-03-07 16:59 ` Willem de Bruijn
@ 2018-03-07 22:03 ` Jesus Sanchez-Palencia
0 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 22:03 UTC (permalink / raw)
To: intel-wired-lan
On 03/07/2018 08:59 AM, Willem de Bruijn wrote:
> On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
> <jesus.sanchez-palencia@intel.com> wrote:
>> This is done in preparation for the upcoming time based transmission
>> patchset. Now that skb->tstamp will be used to hold packet's txtime,
>> we must ensure that it is being cleared when traversing namespaces.
>> Also, doing that from skb_scrub_packet() would break our feature when
>> tunnels are used.
>
> Then the right location to move to is skb_scrub_packet below the test for xnet.
Fixed, thanks.
>
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>> ---
>> include/linux/netdevice.h | 1 +
>> net/core/skbuff.c | 1 -
>> 2 files changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index dbe6344b727a..7104de2bc957 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>>
>> skb_scrub_packet(skb, true);
>> skb->priority = 0;
>> + skb->tstamp = 0;
>> return 0;
>> }
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c13495ba6..678fc5416ae1 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>> */
>> void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>> {
>> - skb->tstamp = 0;
>> skb->pkt_type = PACKET_HOST;
>> skb->skb_iif = 0;
>> skb->ignore_df = 0;
>> --
>> 2.16.2
>>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 21:52 ` Jesus Sanchez-Palencia
@ 2018-03-07 22:45 ` Eric Dumazet
2018-03-07 23:03 ` David Miller
2018-03-08 11:37 ` Miroslav Lichvar
0 siblings, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2018-03-07 22:45 UTC (permalink / raw)
To: intel-wired-lan
On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote:
> Hi,
...
> I should have mentioned on the commit msg, but the tc_drop_if_late is
> actually
> filling a 1 bit hole that was already there.
>
>
> >
> > Do we really need 32 bits for a clockid_t ?
>
> There is a 2 bytes hole just after tc_index, so a u16 clockid would
> fit
> perfectly without increasing the skbuffs size / cachelines any
> further.
>
> From Richard's reply, it seems safe to just change the definition
> here if we
> make it explicit on the SCM_CLOCKID documentation the caveat about
> the max
> possible fd count for dynamic clocks.
>
> How does that sound?
Not convincing really :/
Next big feature needing one bit in sk_buff will add it, and add a
63bit hole.
Then next feature(s) will happily consume 'because there are holes
anyway'.
Then at some point we will cross cache line boundary and performance
will take a 10 % hit.
It is a never ending trend.
If you really need 33 bits, then maybe we'll ask you to guard the new
bits with some #if IS_ENABLED(CONFIG_...) so that we can opt-out.
Why do we _really_ need dynamic clocks being supported in core
networking stack, other than 'that is needed to send 2 packets per
second with precise departure time and arbitrary user defined clocks,
so lets do that, and do not care of the other 10,000,000 packets we
receive/send per second'
I have one patch (TXCS, something that I called XPS in the past)
implementing the remote-freeing of skbs that help workloads where skb
are produced on cpu A and consumed on cpu B,
using an additional 16bit field that I have not upstreamed yet (even if
Mellanox folks want that), simply because of this additional field...
Maybe I should eat this hole before you take it ?
No, we need to be extra careful.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 22:45 ` Eric Dumazet
@ 2018-03-07 23:03 ` David Miller
2018-03-08 11:37 ` Miroslav Lichvar
1 sibling, 0 replies; 52+ messages in thread
From: David Miller @ 2018-03-07 23:03 UTC (permalink / raw)
To: intel-wired-lan
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 07 Mar 2018 14:45:45 -0800
> No, we need to be extra careful.
+1
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 22:45 ` Eric Dumazet
2018-03-07 23:03 ` David Miller
@ 2018-03-08 11:37 ` Miroslav Lichvar
2018-03-08 16:25 ` David Miller
1 sibling, 1 reply; 52+ messages in thread
From: Miroslav Lichvar @ 2018-03-08 11:37 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 07, 2018 at 02:45:45PM -0800, Eric Dumazet wrote:
> On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote:
> > > Do we really need 32 bits for a clockid_t ?
> >
> > There is a 2 bytes hole just after tc_index, so a u16 clockid would
> > fit
> > perfectly without increasing the skbuffs size / cachelines any
> > further.
> Not convincing really :/
>
> Next big feature needing one bit in sk_buff will add it, and add a
> 63bit hole.
Would it be possible to put the clockid in skb_shared_info? If that's
technically difficult or does not make sense, I'm ok with the clockid
being a socket option.
If a packet is sent immediately after changing the clockid via
setsockopt(), will it be still guaranteed that the packet is
restricted by the new id?
> Why do we _really_ need dynamic clocks being supported in core
> networking stack, other than 'that is needed to send 2 packets per
> second with precise departure time and arbitrary user defined clocks,
> so lets do that, and do not care of the other 10,000,000 packets we
> receive/send per second'
Well, I'd not expect it to be a common use case, but a public NTP
server could be sending millions of packets per second in traffic
peaks (typically at *:00:00) over multiple interfaces.
--
Miroslav Lichvar
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
` (18 preceding siblings ...)
2018-03-07 5:28 ` [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Richard Cochran
@ 2018-03-08 14:09 ` Henrik Austad
2018-03-08 18:06 ` Jesus Sanchez-Palencia
19 siblings, 1 reply; 52+ messages in thread
From: Henrik Austad @ 2018-03-08 14:09 UTC (permalink / raw)
To: intel-wired-lan
On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> This series is the v3 of the Time based packet transmission RFC, which was
> originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
> and further developed by us with the addition of the tbs qdisc
> (v2: https://lwn.net/Articles/744797/ ).
Nice!
> It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
> implements support for hw offloading on the igb driver for the Intel
> i210 NIC. The tbs qdisc also supports SW best effort that can be used
> as a fallback.
>
> The main changes since v2 can be found below.
>
> Fixes since v2:
> - skb->tstamp is only cleared on the forwarding path;
> - ktime_t is no longer the type used for timestamps (s64 is);
> - get_unaligned() is now used for copying data from the cmsg header;
> - added getsockopt() support for SO_TXTIME;
> - restricted SO_TXTIME input range to [0,1];
> - removed ns_capable() check from __sock_cmsg_send();
> - the qdisc control struct now uses a 32 bitmap for config flags;
> - fixed qdisc backlog decrement bug;
> - 'overlimits' is now incremented on dequeue() drops in addition to the
> 'dropped' counter;
>
> Interface changes since v2:
> * CMSG interface:
> - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
> - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
> * tc-tbs:
> - clockid now receives a string;
> e.g.: CLOCK_REALTIME or /dev/ptp0
> - offload is now a standalone argument (i.e. no more offload 1);
> - sorting is now argument that enables txtime based sorting provided
> by the qdisc;
>
> Design changes since v2:
> - Now on the dequeue() path, tbs only drops an expired packet if it has the
> skb->tc_drop_if_late flag set. In practical terms, this will define if
> the semantics of txtime on a system is "not earlier than" or "not later
> than" a given timestamp;
> - Now on the enqueue() path, the qdisc will drop a packet if its clockid
> doesn't match the qdisc's one;
> - Sorting the packets based on their txtime is now an option for the disc.
> Effectively, this means it can be configured in 4 modes: HW offload or
> SW best-effort, sorting enabled or disabled;
A lot of new knobs, I see the need, I would've like to have fewer, but
you've documented them pretty well. Perhaps we should add something to
Documentation/ at one stage?
Anyways, the patches applied cleanly so I gave them a (very) quick spin.
Using udp_tai and tcpdump in the other end to grab the frames
Setting up with hw offload and sorting in qdisc.
Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss
bypass as dual-core and i210 is not friends):
udp_tai -c1 -i eth2 -p 20 -P 10000000
Receiver (imx7, kernel 4.9.11):
chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
Note: this involves 2 swtiches and a somewhat hackish kernel running on the
receiver, so these numbers can only improve.
count 2340.000000
mean 0.043770
std 0.047784
min 0.009025
25% 0.010003
50% 0.010010
75% 0.109998
max 0.120060
I have to dig more into why this is happening, a lot frames delayed much
more than I'd expect, but at this stage I'm pretty sure this is pebkac. One
obvious fix is move some hw around and do a direct link, but I didn't have
time for that right now.
I'm very interested in doing what Richard's original test was when he used
ptp-synched clocks and also used hw receive-time and compared with expected
tx-time. So, while I'm getting that up and running, I thought I should
share the early results.
-Henrik
> The tbs qdisc is designed so it buffers packets until a configurable time before
> their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
> fallback modes, the qdisc uses a rbtree internally so the buffered packets are
> always 'ordered' by the earliest deadline.
>
> If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
> through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
> it will use a 'scheduled' FIFO.
>
> The other configurable parameter from the tbs qdisc is the clockid to be used.
> In order to provide that, this series adds a new API to pkt_sched.h (i.e.
> qdisc_watchdog_init_clockid()).
>
> The tbs qdisc will drop any packets with a transmission time in the past or
> when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
> advance plus configuring the delta parameter for the system correctly makes
> all the difference in reducing the number of drops. Moreover, note that the
> delta parameter ends up defining the Tx time when SW best-effort is used
> given that the timestamps won't be used by the NIC on this case.
>
> Examples:
>
> # SW best-effort with sorting #
>
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>
> $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
> clockid CLOCK_REALTIME sorting
>
> In this example first the mqprio qdisc is setup, then the tbs qdisc is
> configured onto the first hw Tx queue using SW best-effort with sorting
> enabled. Also, it is configured so the timestamps on each packet are in
> reference to the clockid CLOCK_REALTIME and so packets are dequeued from
> the qdisc 100000 nanoseconds before their transmission time.
>
>
> # HW offload without sorting #
>
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
>
> In this example, the Qdisc will use HW offload for the control of the
> transmission time through the network adapter. It's assumed implicitly
> the timestamp in skbuffs are in reference to the interface's PHC and
> setting any other valid clockid would be treated as an error. Because
> there is no scheduling being performed in the qdisc, setting a delta != 0
> would also be considered an error.
>
>
> # HW offload with sorting #
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
> clockid CLOCK_REALTIME sorting
>
> Here, the Qdisc will use HW offload for the txtime control again,
> but now sorting will be enabled, and thus there will be scheduling being
> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
> and packets leave the Qdisc "delta" (100000) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by the hrtimer, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.
>
>
> For testing, we've followed a similar approach from the v1 and v2 testing and
> no significant changes on the results were observed. An updated version of
> udp_tai.c is attached to this cover letter.
>
> For last, most of the To Dos we still have before a final patchset are related
> to further testing the igb support:
> - testing with L2 only talkers + AF_PACKET sockets;
> - testing tbs in conjunction with cbs;
>
> Thanks for all the feedback so far,
> Jesus
-Henrik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20180308/d020e0c8/attachment.asc>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-08 11:37 ` Miroslav Lichvar
@ 2018-03-08 16:25 ` David Miller
0 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2018-03-08 16:25 UTC (permalink / raw)
To: intel-wired-lan
From: Miroslav Lichvar <mlichvar@redhat.com>
Date: Thu, 8 Mar 2018 12:37:22 +0100
> Well, I'd not expect it to be a common use case, but a public NTP
> server could be sending millions of packets per second in traffic
> peaks (typically at *:00:00) over multiple interfaces.
That's the problem.
Bloating up sk_buff for an uncommon use case, penalizing all others,
is a non-starter.
Sorry.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 17:47 ` Eric Dumazet
@ 2018-03-08 16:44 ` Richard Cochran
2018-03-08 17:56 ` Jesus Sanchez-Palencia
0 siblings, 1 reply; 52+ messages in thread
From: Richard Cochran @ 2018-03-08 16:44 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
> I would love if skb->tstamp could be either 0 or expressed in
> ktime_get() base all the time.
>
> ( Even if we would have to convert this to other bases when/if needed)
We really do need variable clock IDs. Otherwise the HW offloading
case won't work. The desired transmit time must be expressed in terms
of the clock inside the MAC. This clock is not necessarily related to
the system time at all.
But in addition to the performance concerns, I think putting this into
a socket option is the more natural solution.
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-08 16:44 ` Richard Cochran
@ 2018-03-08 17:56 ` Jesus Sanchez-Palencia
0 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 17:56 UTC (permalink / raw)
To: intel-wired-lan
Hi,
On 03/08/2018 08:44 AM, Richard Cochran wrote:
> On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
>> I would love if skb->tstamp could be either 0 or expressed in
>> ktime_get() base all the time.
>>
>> ( Even if we would have to convert this to other bases when/if needed)
>
> We really do need variable clock IDs. Otherwise the HW offloading
> case won't work. The desired transmit time must be expressed in terms
> of the clock inside the MAC. This clock is not necessarily related to
> the system time at all.
>
> But in addition to the performance concerns, I think putting this into
> a socket option is the more natural solution.
Ok, so we have it settled for clockid now. Providing it per-socket was what we'd
proposed previously, so this was just an attempt to accommodate all the feedback
we got on the v2 RFC.
What about the tc_drop_if_late bit, though? Would it be acceptable to keep it
per-packet, thus eating the 1-bit hole from skbuff if we would #if guard it
(e.g. with CONFIG_NET_SCH_TBS)?
Thanks,
Jesus
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
2018-03-08 14:09 ` Henrik Austad
@ 2018-03-08 18:06 ` Jesus Sanchez-Palencia
2018-03-08 22:54 ` Henrik Austad
0 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 18:06 UTC (permalink / raw)
To: intel-wired-lan
Hi,
On 03/08/2018 06:09 AM, Henrik Austad wrote:
(...)
>
> A lot of new knobs, I see the need, I would've like to have fewer, but
> you've documented them pretty well. Perhaps we should add something to
> Documentation/ at one stage?
Sure. The idea is working on that once the interfaces have been accepted.
>
> Anyways, the patches applied cleanly so I gave them a (very) quick spin.
> Using udp_tai and tcpdump in the other end to grab the frames
>
> Setting up with hw offload and sorting in qdisc.
>
> Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss
> bypass as dual-core and i210 is not friends):
>
> udp_tai -c1 -i eth2 -p 20 -P 10000000
>
> Receiver (imx7, kernel 4.9.11):
> chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
>
> Note: this involves 2 swtiches and a somewhat hackish kernel running on the
> receiver, so these numbers can only improve.
>
> count 2340.000000
> mean 0.043770
> std 0.047784
> min 0.009025
> 25% 0.010003
> 50% 0.010010
> 75% 0.109998
> max 0.120060
>
Thanks for giving it a shot.
But I'm not sure I follow the numbers above, sorry :/
Are you computing the packet's Rx timestamp offset from the (expected) Tx time?
> I have to dig more into why this is happening, a lot frames delayed much
> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One
> obvious fix is move some hw around and do a direct link, but I didn't have
> time for that right now.
>
> I'm very interested in doing what Richard's original test was when he used
> ptp-synched clocks and also used hw receive-time and compared with expected
> tx-time. So, while I'm getting that up and running, I thought I should
> share the early results.
Sure, thanks. Which delta and clockid are you using, please?
Also, was this clock synchronized to the PHC? You need that for hw offload with
sorting enabled.
Thanks,
Jesus
(...)
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
2018-03-08 18:06 ` Jesus Sanchez-Palencia
@ 2018-03-08 22:54 ` Henrik Austad
2018-03-08 23:58 ` Jesus Sanchez-Palencia
0 siblings, 1 reply; 52+ messages in thread
From: Henrik Austad @ 2018-03-08 22:54 UTC (permalink / raw)
To: intel-wired-lan
On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote:
> Hi,
>
>
> On 03/08/2018 06:09 AM, Henrik Austad wrote:
>
> (...)
>
> >
> > A lot of new knobs, I see the need, I would've like to have fewer, but
> > you've documented them pretty well. Perhaps we should add something to
> > Documentation/ at one stage?
>
> Sure. The idea is working on that once the interfaces have been accepted.
Yeah, probably a good idea.
> > Anyways, the patches applied cleanly so I gave them a (very) quick spin.
> > Using udp_tai and tcpdump in the other end to grab the frames
> >
> > Setting up with hw offload and sorting in qdisc.
> >
> > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss
> > bypass as dual-core and i210 is not friends):
> >
> > udp_tai -c1 -i eth2 -p 20 -P 10000000
> >
> > Receiver (imx7, kernel 4.9.11):
> > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
> >
> > Note: this involves 2 swtiches and a somewhat hackish kernel running on the
> > receiver, so these numbers can only improve.
> >
> > count 2340.000000
> > mean 0.043770
> > std 0.047784
> > min 0.009025
> > 25% 0.010003
> > 50% 0.010010
> > 75% 0.109998
> > max 0.120060
> >
>
> Thanks for giving it a shot.
>
> But I'm not sure I follow the numbers above, sorry :/
> Are you computing the packet's Rx timestamp offset from the (expected) Tx time?
Just looking at the timestamp when the frames were received. They should be
sent at regular intervals if I read udp_tai.c correctly, so the assumption
was that the timestamp from tcpdump should give an inkling to how well it
worked.
I set it up to send a frame every 10ms and computed the diff between each
UDP packet received. Nothing fancy, just tcpdump and grep for the
timestamp and look at the distribution.
> > I have to dig more into why this is happening, a lot frames delayed much
> > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One
> > obvious fix is move some hw around and do a direct link, but I didn't have
> > time for that right now.
> >
> > I'm very interested in doing what Richard's original test was when he used
> > ptp-synched clocks and also used hw receive-time and compared with expected
> > tx-time. So, while I'm getting that up and running, I thought I should
> > share the early results.
>
> Sure, thanks. Which delta and clockid are you using, please?
I used the example provided in -00,
tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
tc qdisc add dev eth2 parent 100:1 tbs offload delta 100000 clockid \
CLOCK_REALTIME sorting
> Also, was this clock synchronized to the PHC? You need that for hw offload with
> sorting enabled.
Hmm, good point, no, NIC clock was not synchronized, I'll do that in the
next round for both sender and receiver!
-henrik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20180308/c4066803/attachment-0001.asc>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
2018-03-08 22:54 ` Henrik Austad
@ 2018-03-08 23:58 ` Jesus Sanchez-Palencia
0 siblings, 0 replies; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 23:58 UTC (permalink / raw)
To: intel-wired-lan
Hi,
On 03/08/2018 02:54 PM, Henrik Austad wrote:
> Just looking at the timestamp when the frames were received. They should be
> sent at regular intervals if I read udp_tai.c correctly, so the assumption
> was that the timestamp from tcpdump should give an inkling to how well it
> worked.
>
> I set it up to send a frame every 10ms and computed the diff between each
> UDP packet received. Nothing fancy, just tcpdump and grep for the
> timestamp and look at the distribution.
Ok, I see it now. Just as a reference, this is how I've been running tcpdump on
my tests:
$ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \
-tt --time-stamp-precision=nano udp port 7788 -c 10000
>
>>> I have to dig more into why this is happening, a lot frames delayed much
>>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One
>>> obvious fix is move some hw around and do a direct link, but I didn't have
>>> time for that right now.
>>>
>>> I'm very interested in doing what Richard's original test was when he used
>>> ptp-synched clocks and also used hw receive-time and compared with expected
>>> tx-time. So, while I'm getting that up and running, I thought I should
>>> share the early results.
>>
>> Sure, thanks. Which delta and clockid are you using, please?
>
> I used the example provided in -00,
>
> tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>
> tc qdisc add dev eth2 parent 100:1 tbs offload delta 100000 clockid \
> CLOCK_REALTIME sorting
The delta value is highly dependent on the system. I recommend playing around
with it a bit before running long tests. On my KabyLake desktop I noticed that
150us is quite reliable value, for example. (same kernel as yours, and no
preempt-rt applied) But that is not the issue here it seems.
>
>> Also, was this clock synchronized to the PHC? You need that for hw offload with
>> sorting enabled.
>
> Hmm, good point, no, NIC clock was not synchronized, I'll do that in the
> next round for both sender and receiver!
Oh, then you need to get that setup first. Here I synchronize both PHCs over the
network first with ptp4l:
Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2
Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 &
My Rx is the PTP master and the Tx is the PTP slave.
Then I synchronize the PHC to the system clock on the Tx side only:
Tx) $ phc2sys -a -r -r -u 8 &
And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no
difference for this test specifically because I compensate for it when
calculating the offsets on the Rx side.
For the next patchset version I will be providing a more complete set of testing
instructions. I hope that helps for now.
Thanks,
Jesus
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-07 5:24 ` Richard Cochran
2018-03-07 17:01 ` Willem de Bruijn
@ 2018-03-21 12:58 ` Thomas Gleixner
2018-03-21 14:59 ` Richard Cochran
1 sibling, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2018-03-21 12:58 UTC (permalink / raw)
To: intel-wired-lan
On Tue, 6 Mar 2018, Richard Cochran wrote:
> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> > This is adding 32+1 bits to sk_buff, and possibly holes in this very
> > very hot (and already too fat) structure.
> >
> > Do we really need 32 bits for a clockid_t ?
>
> Probably we can live with fewer bits.
>
> For clock IDs with a positive sign, the max possible clock value is 16.
>
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor. So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
>
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.
Errm. No. There is no way to support fd based clocks or one of the CPU
time/process time based clocks for this.
CLOCK_REALTIME and CLOCK_MONOTONIC are probably the only interesting
ones. BOOTTIME is hopefully soon irrelevant as we make MONOTONIC and
BOOTTIME the same unless this causes unexpectedly a major issues. I don't
think that CLOCK_TAI makes sense in that context, but I might be wrong.
The rest of the CLOCK_* space cannot be used at all.
So you need at max 2 bits for this, but I think 1 is good enough.
Thanks,
tglx
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Jesus Sanchez-Palencia
@ 2018-03-21 13:46 ` Thomas Gleixner
2018-04-23 18:21 ` Jesus Sanchez-Palencia
0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2018-03-21 13:46 UTC (permalink / raw)
To: intel-wired-lan
On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> +struct tbs_sched_data {
> + bool sorting;
> + int clockid;
> + int queue;
> + s32 delta; /* in ns */
> + ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> + struct rb_root head;
Hmm. You are reimplementing timerqueue open coded. Have you checked whether
you could reuse the timerqueue implementation?
That requires to add a timerqueue node to struct skbuff
@@ -671,7 +671,8 @@ struct sk_buff {
unsigned long dev_scratch;
};
};
- struct rb_node rbnode; /* used in netem & tcp stack */
+ struct rb_node rbnode; /* used in netem & tcp stack */
+ struct timerqueue_node tqnode;
};
struct sock *sk;
Then you can use timerqueue_head in your scheduler data and all the open
coded rbtree handling goes away.
> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + ktime_t txtime = nskb->tstamp;
> + struct sock *sk = nskb->sk;
> + ktime_t now;
> +
> + if (sk && !sock_flag(sk, SOCK_TXTIME))
> + return false;
> +
> + /* We don't perform crosstimestamping.
> + * Drop if packet's clockid differs from qdisc's.
> + */
> + if (nskb->txtime_clockid != q->clockid)
> + return false;
> +
> + now = get_time_by_clockid(q->clockid);
If you store the time getter function pointer in tbs_sched_data then you
avoid the lookup and just can do
now = q->get_time();
That applies to lots of other places.
> + if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
> + return false;
> +
> + return true;
> +}
> +
> +static struct sk_buff *tbs_peek(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> +
> + return q->peek(sch);
> +}
> +
> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct rb_node *p;
> +
> + p = rb_first(&q->head);
timerqueue gives you direct access to the first expiring entry w/o walking
the rbtree. So that would become:
p = timerqueue_getnext(&q->tqhead);
return p ? rb_to_skb(p) : NULL;
> + if (!p)
> + return NULL;
> +
> + return rb_to_skb(p);
> +}
> +static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
> + struct sk_buff **to_free)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct rb_node **p = &q->head.rb_node, *parent = NULL;
> + ktime_t txtime = nskb->tstamp;
> +
> + if (!is_packet_valid(sch, nskb))
> + return qdisc_drop(nskb, sch, to_free);
> +
> + while (*p) {
> + struct sk_buff *skb;
> +
> + parent = *p;
> + skb = rb_to_skb(parent);
> + if (ktime_after(txtime, skb->tstamp))
> + p = &parent->rb_right;
> + else
> + p = &parent->rb_left;
> + }
> + rb_link_node(&nskb->rbnode, parent, p);
> + rb_insert_color(&nskb->rbnode, &q->head);
That'd become:
nskb->tknode.expires = txtime;
timerqueue_add(&d->tqhead, &nskb->tknode);
> + qdisc_qstats_backlog_inc(sch, nskb);
> + sch->q.qlen++;
> +
> + /* Now we may need to re-arm the qdisc watchdog for the next packet. */
> + reset_watchdog(sch);
> +
> + return NET_XMIT_SUCCESS;
> +}
> +
> +static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
> + bool drop)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> +
> + rb_erase(&skb->rbnode, &q->head);
> +
> + qdisc_qstats_backlog_dec(sch, skb);
> +
> + if (drop) {
> + struct sk_buff *to_free = NULL;
> +
> + qdisc_drop(skb, sch, &to_free);
> + kfree_skb_list(to_free);
> + qdisc_qstats_overlimit(sch);
> + } else {
> + qdisc_bstats_update(sch, skb);
> +
> + q->last = skb->tstamp;
> + }
> +
> + sch->q.qlen--;
> +
> + /* The rbnode field in the skb re-uses these fields, now that
> + * we are done with the rbnode, reset them.
> + */
> + skb->next = NULL;
> + skb->prev = NULL;
> + skb->dev = qdisc_dev(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> +
> + return q->dequeue(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct sk_buff *skb = tbs_peek(sch);
> + ktime_t now, next;
> +
> + if (!skb)
> + return NULL;
> +
> + now = get_time_by_clockid(q->clockid);
> +
> + /* Drop if packet has expired while in queue and the drop_if_late
> + * flag is set.
> + */
> + if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> + struct sk_buff *to_free = NULL;
> +
> + qdisc_queue_drop_head(sch, &to_free);
> + kfree_skb_list(to_free);
> + qdisc_qstats_overlimit(sch);
> +
> + skb = NULL;
> + goto out;
Instead of going out immediately you should check the next skb whether its
due for sending already.
> + }
> +
> + next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> + /* Dequeue only if now is within the [txtime - delta, txtime] range. */
> + if (ktime_after(now, next))
> + skb = qdisc_dequeue_head(sch);
> + else
> + skb = NULL;
> +
> +out:
> + /* Now we may need to re-arm the qdisc watchdog for the next packet. */
> + reset_watchdog(sch);
> +
> + return skb;
> +}
> +
> +static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct sk_buff *skb;
> + ktime_t now, next;
> +
> + skb = tbs_peek(sch);
> + if (!skb)
> + return NULL;
> +
> + now = get_time_by_clockid(q->clockid);
> +
> + /* Drop if packet has expired while in queue and the drop_if_late
> + * flag is set.
> + */
> + if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> + timesortedlist_erase(sch, skb, true);
> + skb = NULL;
> + goto out;
Same as above.
> + }
> +
> + next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> + /* Dequeue only if now is within the [txtime - delta, txtime] range. */
> + if (ktime_after(now, next))
> + timesortedlist_erase(sch, skb, false);
> + else
> + skb = NULL;
> +
> +out:
> + /* Now we may need to re-arm the qdisc watchdog for the next packet. */
> + reset_watchdog(sch);
> +
> + return skb;
> +}
> +
> +static inline void setup_queueing_mode(struct tbs_sched_data *q)
> +{
> + if (q->sorting) {
> + q->enqueue = tbs_enqueue_timesortedlist;
> + q->dequeue = tbs_dequeue_timesortedlist;
> + q->peek = tbs_peek_timesortedlist;
> + } else {
> + q->enqueue = tbs_enqueue_scheduledfifo;
> + q->dequeue = tbs_dequeue_scheduledfifo;
> + q->peek = qdisc_peek_head;
I don't see the point of these two modes and all the duplicated code it
involves.
FIFO mode limits usage to a single thread which has to guarantee that the
packets are queued in time order.
If you look at the use cases of TDM in various fields then FIFO mode is
pretty much useless. In industrial/automotive fieldbus applications the
various time slices are filled by different threads or even processes.
Sure, the rbtree queue/dequeue has overhead compared to a simple linked
list, but you pay for that with more indirections and lots of mostly
duplicated code. And in the worst case one of these code pathes is going to
be rarely used and prone to bitrot.
Thanks,
tglx
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS Jesus Sanchez-Palencia
@ 2018-03-21 14:22 ` Thomas Gleixner
2018-03-21 15:03 ` Richard Cochran
2018-03-22 23:15 ` Jesus Sanchez-Palencia
0 siblings, 2 replies; 52+ messages in thread
From: Thomas Gleixner @ 2018-03-21 14:22 UTC (permalink / raw)
To: intel-wired-lan
On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
>
> In this example, the Qdisc will use HW offload for the control of the
> transmission time through the network adapter. It's assumed the timestamp
> in skbuffs are in reference to the interface's PHC and setting any other
> valid clockid would be treated as an error. Because there is no
> scheduling being performed in the qdisc, setting a delta != 0 would also
> be considered an error.
Which clockid will be handed in from the application? The network adapter
time has no fixed clockid. The only way you can get to it is via a fd based
posix clock and that does not work at all because the qdisc setup might
have a different FD than the application which queues packets.
I think this should look like this:
clock_adapter: 1 = clock of the network adapter
0 = system clock selected by clock_system
clock_system: 0 = CLOCK_REALTIME
1 = CLOCK_MONOTONIC
or something like that.
> Example 2:
>
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
> clockid CLOCK_REALTIME sorting
>
> Here, the Qdisc will use HW offload for the txtime control again,
> but now sorting will be enabled, and thus there will be scheduling being
> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by the hrtimer, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.
So what you do here is queueing the packets in the qdisk and then schedule
them at some point ahead of actual transmission time for delivery to the
hardware. That delivery uses the same txtime as used for qdisc scheduling
to tell the hardware when the packet should go on the wire. That's needed
when the network adapter does not support queueing of multiple packets.
Bah, and probably there you need CLOCK_TAI because that's what PTP is based
on, so clock_system needs to accomodate that as well. Dammit, there goes
the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
bits plus the adapter bit.
Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
don't see us adding new fixed clocks, so we really can reserve #15 for
selecting the adapter clock if sparing that extra bit is truly required.
Thanks,
tglx
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
2018-03-21 12:58 ` Thomas Gleixner
@ 2018-03-21 14:59 ` Richard Cochran
0 siblings, 0 replies; 52+ messages in thread
From: Richard Cochran @ 2018-03-21 14:59 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> Errm. No. There is no way to support fd based clocks or one of the CPU
> time/process time based clocks for this.
Why not?
If the we have HW offloading, then the transmit time had better be
expressed in terms of the MAC's internal clock. Otherwise we would
need to translate between a kernel clock and the MAC clock, but that
is expensive (eg over PCIe) and silly (because in a typical use case
the MAC will already be synchronized to the network time).
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
2018-03-21 14:22 ` Thomas Gleixner
@ 2018-03-21 15:03 ` Richard Cochran
2018-03-22 23:15 ` Jesus Sanchez-Palencia
1 sibling, 0 replies; 52+ messages in thread
From: Richard Cochran @ 2018-03-21 15:03 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.
Duh. That explains it. Please ignore my "why not?" Q in the other thread...
Thanks,
Richard
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
2018-03-21 14:22 ` Thomas Gleixner
2018-03-21 15:03 ` Richard Cochran
@ 2018-03-22 23:15 ` Jesus Sanchez-Palencia
2018-03-23 8:51 ` Thomas Gleixner
1 sibling, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 23:15 UTC (permalink / raw)
To: intel-wired-lan
Hi,
On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
>>
>> In this example, the Qdisc will use HW offload for the control of the
>> transmission time through the network adapter. It's assumed the timestamp
>> in skbuffs are in reference to the interface's PHC and setting any other
>> valid clockid would be treated as an error. Because there is no
>> scheduling being performed in the qdisc, setting a delta != 0 would also
>> be considered an error.
>
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.
Yes. As a result, we came up with a rather simplistic solution that would still
allow for dynamic clocks to be used in the future without any API changes. As of
the v3 RFC, the qdisc returns -EINVAL if a netlink application (i.e. tc) tries
to initialize it in 'raw' hw offload passing any clockid != CLOCKID_INVALID. The
skbuffs' clockid was initialized with the same value, so if the application sets
its value to any other valid clockids through the cmsg interface, the qdisc
would just drop the patches on enqueue() due to the mismatch.
In other words, dynamic clocks are currently not used at all.
(I noticed later that this was broken anyway because the definition of invalid
clockids from posix-timers.h is actually only valid for negative numbers.)
Given all the feedback against adding the clockid into struct sk_buff, for the
next version, we'll have to re-think this anyway now that clockid will be set
per socket (i.e. as an argument to the SO_TXTIME) and not per packet anymore.
>
> I think this should look like this:
>
> clock_adapter: 1 = clock of the network adapter
> 0 = system clock selected by clock_system
>
> clock_system: 0 = CLOCK_REALTIME
> 1 = CLOCK_MONOTONIC
>
> or something like that.
>
>> Example 2:
>>
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
>> clockid CLOCK_REALTIME sorting
>>
>> Here, the Qdisc will use HW offload for the txtime control again,
>> but now sorting will be enabled, and thus there will be scheduling being
>> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
>> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
>> their transmission time. Because this will be using HW offload and
>> since dynamic clocks are not supported by the hrtimer, the system clock
>> and the PHC clock must be synchronized for this mode to behave as expected.
>
> So what you do here is queueing the packets in the qdisk and then schedule
> them at some point ahead of actual transmission time for delivery to the
> hardware. That delivery uses the same txtime as used for qdisc scheduling
> to tell the hardware when the packet should go on the wire. That's needed
> when the network adapter does not support queueing of multiple packets.
>
> Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> on, so clock_system needs to accomodate that as well. Dammit, there goes
> the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> bits plus the adapter bit.
>
> Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> don't see us adding new fixed clocks, so we really can reserve #15 for
> selecting the adapter clock if sparing that extra bit is truly required.
So what about just using the previous single 'clockid' argument, but then just
adding to uapi time.h something like:
#define DYNAMIC_CLOCKID 15
And using it for that, instead. This way applications that will use the raw hw
offload mode must use this value for their per-socket clockid, and the qdisc's
clockid would be implicitly initialized to the same value.
What do you think?
Thanks,
Jesus
>
> Thanks,
>
> tglx
>
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
2018-03-22 23:15 ` Jesus Sanchez-Palencia
@ 2018-03-23 8:51 ` Thomas Gleixner
0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2018-03-23 8:51 UTC (permalink / raw)
To: intel-wired-lan
On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> > Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> > on, so clock_system needs to accomodate that as well. Dammit, there goes
> > the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> > bits plus the adapter bit.
> >
> > Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> > don't see us adding new fixed clocks, so we really can reserve #15 for
> > selecting the adapter clock if sparing that extra bit is truly required.
>
>
> So what about just using the previous single 'clockid' argument, but then just
> adding to uapi time.h something like:
>
> #define DYNAMIC_CLOCKID 15
>
> And using it for that, instead. This way applications that will use the raw hw
> offload mode must use this value for their per-socket clockid, and the qdisc's
> clockid would be implicitly initialized to the same value.
That's what I suggested above.
Thanks,
tglx
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
2018-03-21 13:46 ` Thomas Gleixner
@ 2018-04-23 18:21 ` Jesus Sanchez-Palencia
2018-04-24 8:50 ` Thomas Gleixner
0 siblings, 1 reply; 52+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-04-23 18:21 UTC (permalink / raw)
To: intel-wired-lan
Hi Thomas,
On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> + bool sorting;
>> + int clockid;
>> + int queue;
>> + s32 delta; /* in ns */
>> + ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> + struct rb_root head;
>
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
>
> That requires to add a timerqueue node to struct skbuff
>
> @@ -671,7 +671,8 @@ struct sk_buff {
> unsigned long dev_scratch;
> };
> };
> - struct rb_node rbnode; /* used in netem & tcp stack */
> + struct rb_node rbnode; /* used in netem & tcp stack */
> + struct timerqueue_node tqnode;
> };
> struct sock *sk;
>
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.
I just noticed that doing the above increases the size of struct sk_buff by 8
bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
24bytes long.
Given the feedback we got here before against touching struct sk_buff at all for
non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
keeping the open-coded version for now, ok?
Thanks,
Jesus
(...)
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
2018-04-23 18:21 ` Jesus Sanchez-Palencia
@ 2018-04-24 8:50 ` Thomas Gleixner
2018-04-24 13:50 ` David Miller
0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2018-04-24 8:50 UTC (permalink / raw)
To: intel-wired-lan
On Mon, 23 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> >> +struct tbs_sched_data {
> >> + bool sorting;
> >> + int clockid;
> >> + int queue;
> >> + s32 delta; /* in ns */
> >> + ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> >> + struct rb_root head;
> >
> > Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> > you could reuse the timerqueue implementation?
> >
> > That requires to add a timerqueue node to struct skbuff
> >
> > @@ -671,7 +671,8 @@ struct sk_buff {
> > unsigned long dev_scratch;
> > };
> > };
> > - struct rb_node rbnode; /* used in netem & tcp stack */
> > + struct rb_node rbnode; /* used in netem & tcp stack */
> > + struct timerqueue_node tqnode;
> > };
> > struct sock *sk;
> >
> > Then you can use timerqueue_head in your scheduler data and all the open
> > coded rbtree handling goes away.
>
>
> I just noticed that doing the above increases the size of struct sk_buff by 8
> bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
> 24bytes long.
>
> Given the feedback we got here before against touching struct sk_buff at all for
> non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
> keeping the open-coded version for now, ok?
The size of sk_buff is 216 and the size of sk_buff_fclones is 440
bytes. The sk_buff and sk_buff_fclones kmem_caches use objects sized 256
and 512 bytes because the kmem_caches are created with SLAB_HWCACHE_ALIGN.
So adding 8 bytes to spare duplicated code will not change the kmem_cache
object size and I really doubt that anyone will notice.
Thanks,
tglx
^ permalink raw reply [flat|nested] 52+ messages in thread
* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
2018-04-24 8:50 ` Thomas Gleixner
@ 2018-04-24 13:50 ` David Miller
0 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2018-04-24 13:50 UTC (permalink / raw)
To: intel-wired-lan
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 24 Apr 2018 10:50:04 +0200 (CEST)
> So adding 8 bytes to spare duplicated code will not change the kmem_cache
> object size and I really doubt that anyone will notice.
It's about where the cache lines end up when each and every byte is added
to the structure, not just the slab object size.
^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2018-04-24 13:50 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-07 1:12 [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case Jesus Sanchez-Palencia
2018-03-07 16:58 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path Jesus Sanchez-Palencia
2018-03-07 16:59 ` Willem de Bruijn
2018-03-07 22:03 ` Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission Jesus Sanchez-Palencia
2018-03-07 17:00 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: " Jesus Sanchez-Palencia
2018-03-07 17:00 ` Willem de Bruijn
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 07/18] net: packet: " Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params Jesus Sanchez-Palencia
2018-03-07 2:53 ` Eric Dumazet
2018-03-07 5:24 ` Richard Cochran
2018-03-07 17:01 ` Willem de Bruijn
2018-03-07 17:35 ` Richard Cochran
2018-03-07 17:37 ` Richard Cochran
2018-03-07 17:47 ` Eric Dumazet
2018-03-08 16:44 ` Richard Cochran
2018-03-08 17:56 ` Jesus Sanchez-Palencia
2018-03-21 12:58 ` Thomas Gleixner
2018-03-21 14:59 ` Richard Cochran
2018-03-07 21:52 ` Jesus Sanchez-Palencia
2018-03-07 22:45 ` Eric Dumazet
2018-03-07 23:03 ` David Miller
2018-03-08 11:37 ` Miroslav Lichvar
2018-03-08 16:25 ` David Miller
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 10/18] net: ipv4: udp: " Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 11/18] net: packet: " Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Jesus Sanchez-Palencia
2018-03-21 13:46 ` Thomas Gleixner
2018-04-23 18:21 ` Jesus Sanchez-Palencia
2018-04-24 8:50 ` Thomas Gleixner
2018-04-24 13:50 ` David Miller
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS Jesus Sanchez-Palencia
2018-03-21 14:22 ` Thomas Gleixner
2018-03-21 15:03 ` Richard Cochran
2018-03-22 23:15 ` Jesus Sanchez-Palencia
2018-03-23 8:51 ` Thomas Gleixner
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs() Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs() Jesus Sanchez-Palencia
2018-03-07 1:12 ` [Intel-wired-lan] [RFC v3 net-next 18/18] igb: Add support for TBS offload Jesus Sanchez-Palencia
2018-03-07 5:28 ` [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission Richard Cochran
2018-03-08 14:09 ` Henrik Austad
2018-03-08 18:06 ` Jesus Sanchez-Palencia
2018-03-08 22:54 ` Henrik Austad
2018-03-08 23:58 ` Jesus Sanchez-Palencia
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox