* RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
From: Hayes Wang @ 2016-11-24 13:26 UTC (permalink / raw)
To: Mark Lord, netdev@vger.kernel.org
Cc: nic_swsd, linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org
In-Reply-To: <baf5246d-9d8a-4029-6823-350ed561fd33@pobox.com>
Mark Lord [mailto:mlord@pobox.com]
> Sent: Thursday, November 24, 2016 8:31 PM
[...]
> Nope. Guard zones did not fix it, so it's probably not a prefetch issue.
> Oddly, adding a couple of memory barriers to specific places in the driver
> does help, A LOT. Still not 100%, but it did pass 1800 reboot tests over night
> with only three bad rx_desc's reported.
>
> That's a new record here for the driver using kmalloc'd buffers,
> and put reliability on par with using non-cacheable buffers.
>
> Any way we look at it though, the chip/driver are simply unreliable,
> and relying upon hardware checksums (which fail due to the driver
> looking at garbage rather than the checksum bits) leads to data corruption.
I don't think the garbage results from our driver or device.
If it is the issue about memory, I think the host driver ought
to deal with it, because it handles the DMA.
Besides, it doesn't seem to occur for all platforms. I have
tested the iperf more than 26 hours, and it still works fine.
I think I would get the same result on x86 or x86_64 platform.
Best Regards,
Hayes
^ permalink raw reply
* Re: [PATCH iproute2 0/2] tc/cls_flower: Support for ip tunnel metadata set/release/classify
From: Jiri Benc @ 2016-11-24 13:38 UTC (permalink / raw)
To: Amir Vadai
Cc: Stephen Hemminger, David S. Miller, netdev, Or Gerlitz,
Hadar Har-Zion, Roi Dayan
In-Reply-To: <20161121102056.13468-1-amir@vadai.me>
On Mon, 21 Nov 2016 12:20:54 +0200, Amir Vadai wrote:
> $ tc filter add dev vxlan0 protocol ip parent ffff: \
> flower \
> enc_src_ip 11.11.0.2 \
> enc_dst_ip 11.11.0.1 \
> enc_key_id 11 \
> dst_ip 11.11.11.1 \
> action tunnel_key release \
> action mirred egress redirect dev vnet0
I really hate the "action tunnel_key release". This just exposes the
kernel internal implementation detail (dst_metadata) to the user. Why
should the user care about explicit releasing of the tunnel key? This
should happen automatically. Users do not care about our internal
implementation.
> $ tc filter add dev net0 protocol ip parent ffff: \
> flower \
> ip_proto 1 \
> dst_ip 11.11.11.2 \
> action tunnel_key set \
> src_ip 11.11.0.1 \
> dst_ip 11.11.0.2 \
> id 11 \
> action mirred egress redirect dev vxlan0
Do you see the asymmetry? This is not called "alloc tunnel_key", and
rightly so. It's very reasonable to call this "set", as it is what the
action looks like to the user.
The only argument for the existence of an explicit "release" (we should
rather call it "unset" in such case, though) is forwarding between two
tunnels, where metadata from the first tunnel will be used for
encapsulation done by the second tunnel. Or a similar case when there's
classification based on the tunnel metadata done on the mirred
interface. Somewhat corner cases, though. If we want to support them,
then let's call the action "unset" and not "release". And in any case,
it should not be mandatory to specify it, which should be made clear
in the documentation (including examples where it is needed - basically
only when forwarding between tunnels).
Jiri
^ permalink raw reply
* Re: [patch net-next] sfc: remove unneeded variable
From: Dan Carpenter @ 2016-11-24 13:49 UTC (permalink / raw)
To: Edward Cree
Cc: Solarflare linux maintainers, Bert Kenward, netdev,
kernel-janitors
In-Reply-To: <698affdd-d4e0-53c0-fff0-9b66252504a0@solarflare.com>
On Thu, Nov 24, 2016 at 01:22:24PM +0000, Edward Cree wrote:
> On 24/11/16 11:16, Dan Carpenter wrote:
> > We don't use ->heap_buf after commit 46d1efd852cc ("sfc: remove Software
> > TSO") so let's remove the last traces.
> >
> > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
> > index f97f828..fd17bda 100644
> > --- a/drivers/net/ethernet/sfc/net_driver.h
> > +++ b/drivers/net/ethernet/sfc/net_driver.h
> > @@ -139,8 +139,6 @@ struct efx_special_buffer {
> > * struct efx_tx_buffer - buffer state for a TX descriptor
> > * @skb: When @flags & %EFX_TX_BUF_SKB, the associated socket buffer to be
> > * freed when descriptor completes
> > - * @heap_buf: When @flags & %EFX_TX_BUF_HEAP, the associated heap buffer to be
> > - * freed when descriptor completes.
>
> Does that mean we can also remove EFX_TX_BUF_HEAP?
Good point. I will resend.
regards,
dan carpenter
^ permalink raw reply
* [PATCH] ath5k: drop duplicate header vmalloc.h
From: Geliang Tang @ 2016-11-24 13:58 UTC (permalink / raw)
To: Jiri Slaby, Nick Kossifidis, Luis R. Rodriguez, Kalle Valo
Cc: Geliang Tang, linux-wireless, netdev, linux-kernel
Drop duplicate header vmalloc.h from ath5k/debug.c.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
drivers/net/wireless/ath/ath5k/debug.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/wireless/ath/ath5k/debug.c b/drivers/net/wireless/ath/ath5k/debug.c
index 4f8d9ed..d068df5 100644
--- a/drivers/net/wireless/ath/ath5k/debug.c
+++ b/drivers/net/wireless/ath/ath5k/debug.c
@@ -66,7 +66,6 @@
#include <linux/seq_file.h>
#include <linux/list.h>
-#include <linux/vmalloc.h>
#include "debug.h"
#include "ath5k.h"
#include "reg.h"
--
2.9.3
^ permalink raw reply related
* [PATCH] ibmvnic: drop duplicate header seq_file.h
From: Geliang Tang @ 2016-11-24 13:58 UTC (permalink / raw)
To: Thomas Falcon, John Allen, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman
Cc: Geliang Tang, netdev, linuxppc-dev, linux-kernel
In-Reply-To: <15299de49216a2360976ca37ff774cae9d27d88b.1479991297.git.geliangtang@gmail.com>
Drop duplicate header seq_file.h from ibmvnic.c.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 1e486d1..c125966 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -74,7 +74,6 @@
#include <asm/iommu.h>
#include <linux/uaccess.h>
#include <asm/firmware.h>
-#include <linux/seq_file.h>
#include <linux/workqueue.h>
#include "ibmvnic.h"
--
2.9.3
^ permalink raw reply related
* [PATCH] net: ieee802154: drop duplicate header delay.h
From: Geliang Tang @ 2016-11-24 13:58 UTC (permalink / raw)
To: Michael Hennerich, Alexander Aring
Cc: Geliang Tang, linux-wpan, netdev, linux-kernel
In-Reply-To: <15299de49216a2360976ca37ff774cae9d27d88b.1479991297.git.geliangtang@gmail.com>
Drop duplicate header delay.h from adf7242.c.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
drivers/net/ieee802154/adf7242.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/ieee802154/adf7242.c b/drivers/net/ieee802154/adf7242.c
index 4ff4c7d..3e4c8b2 100644
--- a/drivers/net/ieee802154/adf7242.c
+++ b/drivers/net/ieee802154/adf7242.c
@@ -20,7 +20,6 @@
#include <linux/skbuff.h>
#include <linux/of.h>
#include <linux/irq.h>
-#include <linux/delay.h>
#include <linux/debugfs.h>
#include <linux/bitops.h>
#include <linux/ieee802154.h>
--
2.9.3
^ permalink raw reply related
* [PATCH] net/mlx5: drop duplicate header delay.h
From: Geliang Tang @ 2016-11-24 13:58 UTC (permalink / raw)
To: Saeed Mahameed, Matan Barak, Leon Romanovsky
Cc: Geliang Tang, netdev, linux-rdma, linux-kernel
In-Reply-To: <15299de49216a2360976ca37ff774cae9d27d88b.1479991297.git.geliangtang@gmail.com>
Drop duplicate header delay.h from mlx5/core/main.c.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index f28df33..d7a55eb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -46,7 +46,6 @@
#include <linux/mlx5/srq.h>
#include <linux/debugfs.h>
#include <linux/kmod.h>
-#include <linux/delay.h>
#include <linux/mlx5/mlx5_ifc.h>
#ifdef CONFIG_RFS_ACCEL
#include <linux/cpu_rmap.h>
--
2.9.3
^ permalink raw reply related
* [PATCH iproute2 3/3] ifstat: Add "sw only" extended statistics to ifstat
From: Nogah Frankel @ 2016-11-24 14:12 UTC (permalink / raw)
To: netdev; +Cc: eladr, yotamg, jiri, idosch, ogerlitz, Nogah Frankel
In-Reply-To: <1479996760-61271-1-git-send-email-nogahf@mellanox.com>
Add support for extended statistics of SW only type, for counting only the
packets that went via the cpu. (useful for systems with forward
offloading). It reads it from filter type IFLA_STATS_LINK_OFFLOAD_XSTATS
and sub type IFLA_OFFLOAD_XSTATS_CPU_HIT.
It is under the name 'software'
(or any shorten of it as 'soft' or simply 's').
For example:
ifstat -x s
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
misc/ifstat.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/misc/ifstat.c b/misc/ifstat.c
index 90aeeaa..7825a3a 100644
--- a/misc/ifstat.c
+++ b/misc/ifstat.c
@@ -675,7 +675,8 @@ static int verify_forging(int fd)
static void xstat_usage(void)
{
fprintf(stderr,
-"Usage: ifstat supported xstats:\n");
+"Usage: ifstat supported xstats:\n"
+" software SW stats. Counts only packets that went via the CPU\n");
}
@@ -691,6 +692,7 @@ struct extended_stats_options_t {
*/
static const struct extended_stats_options_t extended_stats_options[] = {
{"", IFLA_STATS_LINK_64, NO_SUB_TYPE},
+ {"software", IFLA_STATS_LINK_OFFLOAD_XSTATS, IFLA_OFFLOAD_XSTATS_CPU_HIT},
};
static bool get_filter_type(char *name)
--
2.4.3
^ permalink raw reply related
* [PATCH iproute2 2/3] ifstat: Add extended statistics to ifstat
From: Nogah Frankel @ 2016-11-24 14:12 UTC (permalink / raw)
To: netdev; +Cc: eladr, yotamg, jiri, idosch, ogerlitz, Nogah Frankel
In-Reply-To: <1479996760-61271-1-git-send-email-nogahf@mellanox.com>
Add extended stats option for ifstat. It supports stats that are in the
nesting level as the "normal" stats or one lower, as long as they are in
the same struct type as the "normal" stats.
Every extension is unaware of data from other extension and is being
presented by itself.
The extension can be called by its name or any shorten of it. If there is
more then one matched, the first one will be picked.
To get the extended stats the flag -x <stats type> is used.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
misc/ifstat.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 81 insertions(+), 7 deletions(-)
diff --git a/misc/ifstat.c b/misc/ifstat.c
index 25a8fc1..90aeeaa 100644
--- a/misc/ifstat.c
+++ b/misc/ifstat.c
@@ -49,11 +49,14 @@ int pretty;
double W;
char **patterns;
int npatterns;
+int filter_type;
+int sub_type;
char info_source[128];
int source_mismatch;
#define MAXS (sizeof(struct rtnl_link_stats64)/sizeof(__u64))
+#define NO_SUB_TYPE 0xffff
struct ifstat_ent {
struct ifstat_ent *next;
@@ -124,7 +127,7 @@ static int get_nlmsg(const struct sockaddr_nl *who,
return -1;
parse_rtattr(tb, IFLA_STATS_MAX, IFLA_STATS_RTA(ifsm), len);
- if (tb[IFLA_STATS_LINK_64] == NULL)
+ if (tb[filter_type] == NULL)
return 0;
n = malloc(sizeof(*n));
@@ -133,7 +136,17 @@ static int get_nlmsg(const struct sockaddr_nl *who,
n->ifindex = ifsm->ifindex;
n->name = strdup(ll_index_to_name(ifsm->ifindex));
- memcpy(&n->ival, RTA_DATA(tb[IFLA_STATS_LINK_64]), sizeof(n->ival));
+
+ if (sub_type == NO_SUB_TYPE) {
+ memcpy(&n->ival, RTA_DATA(tb[filter_type]), sizeof(n->ival));
+ } else {
+ struct rtattr *attr;
+
+ attr = parse_rtattr_one_nested(sub_type, tb[filter_type]);
+ if (attr == NULL)
+ return 0;
+ memcpy(&n->ival, RTA_DATA(attr), sizeof(n->ival));
+ }
memset(&n->rate, 0, sizeof(n->rate));
for (i = 0; i < MAXS; i++)
n->val[i] = n->ival[i];
@@ -152,7 +165,7 @@ static void load_info(void)
exit(1);
ll_init_map(&rth);
- filt_mask = IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64);
+ filt_mask = IFLA_STATS_FILTER_BIT(filter_type);
if (rtnl_wilddump_stats_req_filter(&rth, AF_UNSPEC, RTM_GETSTATS,
filt_mask) < 0) {
perror("Cannot send dump request");
@@ -659,6 +672,50 @@ static int verify_forging(int fd)
return -1;
}
+static void xstat_usage(void)
+{
+ fprintf(stderr,
+"Usage: ifstat supported xstats:\n");
+
+}
+
+struct extended_stats_options_t {
+ char *name;
+ int id;
+ int sub_type;
+};
+
+/* Note: if one xstat name in subset of another, it should be before it in this
+ * list. Therefore the default "" option must always be first.
+ * Name length must be under 64 chars.
+ */
+static const struct extended_stats_options_t extended_stats_options[] = {
+ {"", IFLA_STATS_LINK_64, NO_SUB_TYPE},
+};
+
+static bool get_filter_type(char *name)
+{
+ int name_len;
+ int i;
+
+ name_len = strlen(name);
+ for (i = 0; i < ARRAY_SIZE(extended_stats_options); i++) {
+ const struct extended_stats_options_t *xstat;
+
+ xstat = &extended_stats_options[i];
+ if (strncmp(name, xstat->name, name_len) == 0) {
+ filter_type = xstat->id;
+ sub_type = xstat->sub_type;
+ strcpy(name, xstat->name);
+ return true;
+ }
+ }
+
+ printf("invalid ifstat extension %s\n", name);
+ xstat_usage();
+ return false;
+}
+
static void usage(void) __attribute__((noreturn));
static void usage(void)
@@ -676,7 +733,8 @@ static void usage(void)
" -s, --noupdate don\'t update history\n"
" -t, --interval=SECS report average over the last SECS\n"
" -V, --version output version information\n"
-" -z, --zeros show entries with zero activity\n");
+" -z, --zeros show entries with zero activity\n"
+" -x, --extended=TYPE show extended stats of TYPE\n");
exit(-1);
}
@@ -694,18 +752,22 @@ static const struct option longopts[] = {
{ "interval", 1, 0, 't' },
{ "version", 0, 0, 'V' },
{ "zeros", 0, 0, 'z' },
+ { "extended", 1, 0, 'x'},
{ 0 }
};
+
int main(int argc, char *argv[])
{
char hist_name[128];
struct sockaddr_un sun;
FILE *hist_fp = NULL;
+ char stats_type[64];
int ch;
int fd;
- while ((ch = getopt_long(argc, argv, "hjpvVzrnasd:t:e",
+ memset(stats_type, 0, 128);
+ while ((ch = getopt_long(argc, argv, "hjpvVzrnasd:t:ex:",
longopts, NULL)) != EOF) {
switch (ch) {
case 'z':
@@ -746,6 +808,9 @@ int main(int argc, char *argv[])
exit(-1);
}
break;
+ case 'x':
+ strncpy(stats_type, optarg, 63);
+ break;
case 'v':
case 'V':
printf("ifstat utility, iproute2-ss%s\n", SNAPSHOT);
@@ -760,6 +825,9 @@ int main(int argc, char *argv[])
argc -= optind;
argv += optind;
+ if (!get_filter_type(stats_type))
+ exit(-1);
+
sun.sun_family = AF_UNIX;
sun.sun_path[0] = 0;
sprintf(sun.sun_path+1, "ifstat%d", getuid());
@@ -798,8 +866,14 @@ int main(int argc, char *argv[])
snprintf(hist_name, sizeof(hist_name),
"%s", getenv("IFSTAT_HISTORY"));
else
- snprintf(hist_name, sizeof(hist_name),
- "%s/.ifstat.u%d", P_tmpdir, getuid());
+
+ if (strlen(stats_type) == 0)
+ snprintf(hist_name, sizeof(hist_name),
+ "%s/.ifstat.u%d", P_tmpdir, getuid());
+ else
+ snprintf(hist_name, sizeof(hist_name),
+ "%s/.%s_ifstat.u%d", P_tmpdir, stats_type,
+ getuid());
if (reset_history)
unlink(hist_name);
--
2.4.3
^ permalink raw reply related
* [PATCH iproute2 1/3] ifstat: Change interface to get stats
From: Nogah Frankel @ 2016-11-24 14:12 UTC (permalink / raw)
To: netdev; +Cc: eladr, yotamg, jiri, idosch, ogerlitz, Nogah Frankel
In-Reply-To: <1479996760-61271-1-git-send-email-nogahf@mellanox.com>
ifstat used to get it data from the kernel with RTM_GETLINK.
Change the interface to get this data to RTM_GETSTATS that supports more
stats type beside the default one. It also change the default stats to be
64 bits based.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
misc/ifstat.c | 41 ++++++++++++++++++++++-------------------
1 file changed, 22 insertions(+), 19 deletions(-)
diff --git a/misc/ifstat.c b/misc/ifstat.c
index d551973..25a8fc1 100644
--- a/misc/ifstat.c
+++ b/misc/ifstat.c
@@ -35,6 +35,7 @@
#include <SNAPSHOT.h>
+#include "utils.h"
int dump_zeros;
int reset_history;
int ignore_history;
@@ -52,15 +53,15 @@ int npatterns;
char info_source[128];
int source_mismatch;
-#define MAXS (sizeof(struct rtnl_link_stats)/sizeof(__u32))
+#define MAXS (sizeof(struct rtnl_link_stats64)/sizeof(__u64))
struct ifstat_ent {
struct ifstat_ent *next;
char *name;
int ifindex;
- unsigned long long val[MAXS];
+ __u64 val[MAXS];
double rate[MAXS];
- __u32 ival[MAXS];
+ __u64 ival[MAXS];
};
static const char *stats[MAXS] = {
@@ -109,32 +110,30 @@ static int match(const char *id)
static int get_nlmsg(const struct sockaddr_nl *who,
struct nlmsghdr *m, void *arg)
{
- struct ifinfomsg *ifi = NLMSG_DATA(m);
- struct rtattr *tb[IFLA_MAX+1];
+ struct if_stats_msg *ifsm = NLMSG_DATA(m);
+ struct rtattr *tb[IFLA_STATS_MAX+1];
int len = m->nlmsg_len;
struct ifstat_ent *n;
int i;
- if (m->nlmsg_type != RTM_NEWLINK)
+ if (m->nlmsg_type != RTM_NEWSTATS)
return 0;
- len -= NLMSG_LENGTH(sizeof(*ifi));
+ len -= NLMSG_LENGTH(sizeof(*ifsm));
if (len < 0)
return -1;
- if (!(ifi->ifi_flags&IFF_UP))
- return 0;
-
- parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
- if (tb[IFLA_IFNAME] == NULL || tb[IFLA_STATS] == NULL)
+ parse_rtattr(tb, IFLA_STATS_MAX, IFLA_STATS_RTA(ifsm), len);
+ if (tb[IFLA_STATS_LINK_64] == NULL)
return 0;
n = malloc(sizeof(*n));
if (!n)
abort();
- n->ifindex = ifi->ifi_index;
- n->name = strdup(RTA_DATA(tb[IFLA_IFNAME]));
- memcpy(&n->ival, RTA_DATA(tb[IFLA_STATS]), sizeof(n->ival));
+
+ n->ifindex = ifsm->ifindex;
+ n->name = strdup(ll_index_to_name(ifsm->ifindex));
+ memcpy(&n->ival, RTA_DATA(tb[IFLA_STATS_LINK_64]), sizeof(n->ival));
memset(&n->rate, 0, sizeof(n->rate));
for (i = 0; i < MAXS; i++)
n->val[i] = n->ival[i];
@@ -147,11 +146,15 @@ static void load_info(void)
{
struct ifstat_ent *db, *n;
struct rtnl_handle rth;
+ __u32 filt_mask;
if (rtnl_open(&rth, 0) < 0)
exit(1);
- if (rtnl_wilddump_request(&rth, AF_INET, RTM_GETLINK) < 0) {
+ ll_init_map(&rth);
+ filt_mask = IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64);
+ if (rtnl_wilddump_stats_req_filter(&rth, AF_UNSPEC, RTM_GETSTATS,
+ filt_mask) < 0) {
perror("Cannot send dump request");
exit(1);
}
@@ -216,7 +219,7 @@ static void load_raw_table(FILE *fp)
*next++ = 0;
if (sscanf(p, "%llu", n->val+i) != 1)
abort();
- n->ival[i] = (__u32)n->val[i];
+ n->ival[i] = (__u64)n->val[i];
p = next;
if (!(next = strchr(p, ' ')))
abort();
@@ -546,14 +549,14 @@ static void update_db(int interval)
int i;
for (i = 0; i < MAXS; i++) {
- if ((long)(h1->ival[i] - n->ival[i]) < 0) {
+ if ((long long)(h1->ival[i] - n->ival[i]) < 0) {
memset(n->ival, 0, sizeof(n->ival));
break;
}
}
for (i = 0; i < MAXS; i++) {
double sample;
- unsigned long incr = h1->ival[i] - n->ival[i];
+ unsigned long long incr = h1->ival[i] - n->ival[i];
n->val[i] += incr;
n->ival[i] = h1->ival[i];
--
2.4.3
^ permalink raw reply related
* [PATCH iproute2 0/3] update ifstat for new stats
From: Nogah Frankel @ 2016-11-24 14:12 UTC (permalink / raw)
To: netdev; +Cc: eladr, yotamg, jiri, idosch, ogerlitz, Nogah Frankel
Previously stats were gotten by RTM_GETLINK which return 32 bits based
statistics. It support only one type of stats.
Lately, a new method to get stats was added - RTM_GETSTATS. It supports
ability to choose stats type. The basic stats were changed from 32 bits
based to 64 bits based.
This patchset change ifstat to the new method, add it the ability to
choose an extended type of statistic, and add the extended type of SW
stats for packets that hit cpu.
Nogah Frankel (3):
ifstat: Change interface to get stats
ifstat: Add extended statistics to ifstat
ifstat: Add "sw only" extended statistics to ifstat
misc/ifstat.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 102 insertions(+), 23 deletions(-)
--
2.4.3
^ permalink raw reply
* Re: [PATCH] net/mlx5: drop duplicate header delay.h
From: Matan Barak @ 2016-11-24 14:12 UTC (permalink / raw)
To: Geliang Tang, Saeed Mahameed, Leon Romanovsky
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <03d5a2a0f03458cdb4f2b139cfabc11b5c334f95.1479990943.git.geliangtang-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On 24/11/2016 15:58, Geliang Tang wrote:
> Drop duplicate header delay.h from mlx5/core/main.c.
>
> Signed-off-by: Geliang Tang <geliangtang-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index f28df33..d7a55eb 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -46,7 +46,6 @@
> #include <linux/mlx5/srq.h>
> #include <linux/debugfs.h>
> #include <linux/kmod.h>
> -#include <linux/delay.h>
> #include <linux/mlx5/mlx5_ifc.h>
> #ifdef CONFIG_RFS_ACCEL
> #include <linux/cpu_rmap.h>
>
Thanks.
Acked-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 12/20 v2] net/iucv: Convert to hotplug state machine
From: Sebastian Andrzej Siewior @ 2016-11-24 14:14 UTC (permalink / raw)
To: Ursula Braun; +Cc: linux-kernel, rt, David S. Miller, linux-s390, netdev
In-Reply-To: <20161124091046.hixy3j4ibt7xzezr@linutronix.de>
Install the callbacks via the state machine and let the core invoke the
callbacks on the already online CPUs. The smp function calls in the
online/downprep callbacks are not required as the callback is guaranteed to
be invoked on the upcoming/outgoing cpu.
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: Use explicit labels for clean up in iucv_init() as suggested by
Ursula.
include/linux/cpuhotplug.h | 1
net/iucv/iucv.c | 122 ++++++++++++++++-----------------------------
2 files changed, 47 insertions(+), 76 deletions(-)
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -63,6 +63,7 @@ enum cpuhp_state {
CPUHP_X86_THERM_PREPARE,
CPUHP_X86_CPUID_PREPARE,
CPUHP_X86_MSR_PREPARE,
+ CPUHP_NET_IUCV_PREPARE,
CPUHP_TIMERS_DEAD,
CPUHP_NOTF_ERR_INJ_PREPARE,
CPUHP_MIPS_SOC_PREPARE,
--- a/net/iucv/iucv.c
+++ b/net/iucv/iucv.c
@@ -639,7 +639,7 @@ static void iucv_disable(void)
put_online_cpus();
}
-static void free_iucv_data(int cpu)
+static int iucv_cpu_dead(unsigned int cpu)
{
kfree(iucv_param_irq[cpu]);
iucv_param_irq[cpu] = NULL;
@@ -647,9 +647,10 @@ static void free_iucv_data(int cpu)
iucv_param[cpu] = NULL;
kfree(iucv_irq_data[cpu]);
iucv_irq_data[cpu] = NULL;
+ return 0;
}
-static int alloc_iucv_data(int cpu)
+static int iucv_cpu_prepare(unsigned int cpu)
{
/* Note: GFP_DMA used to get memory below 2G */
iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data),
@@ -671,58 +672,38 @@ static int alloc_iucv_data(int cpu)
return 0;
out_free:
- free_iucv_data(cpu);
+ iucv_cpu_dead(cpu);
return -ENOMEM;
}
-static int iucv_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
+static int iucv_cpu_online(unsigned int cpu)
+{
+ if (!iucv_path_table)
+ return 0;
+ iucv_declare_cpu(NULL);
+ return 0;
+}
+
+static int iucv_cpu_down_prep(unsigned int cpu)
{
cpumask_t cpumask;
- long cpu = (long) hcpu;
- switch (action) {
- case CPU_UP_PREPARE:
- case CPU_UP_PREPARE_FROZEN:
- if (alloc_iucv_data(cpu))
- return notifier_from_errno(-ENOMEM);
- break;
- case CPU_UP_CANCELED:
- case CPU_UP_CANCELED_FROZEN:
- case CPU_DEAD:
- case CPU_DEAD_FROZEN:
- free_iucv_data(cpu);
- break;
- case CPU_ONLINE:
- case CPU_ONLINE_FROZEN:
- case CPU_DOWN_FAILED:
- case CPU_DOWN_FAILED_FROZEN:
- if (!iucv_path_table)
- break;
- smp_call_function_single(cpu, iucv_declare_cpu, NULL, 1);
- break;
- case CPU_DOWN_PREPARE:
- case CPU_DOWN_PREPARE_FROZEN:
- if (!iucv_path_table)
- break;
- cpumask_copy(&cpumask, &iucv_buffer_cpumask);
- cpumask_clear_cpu(cpu, &cpumask);
- if (cpumask_empty(&cpumask))
- /* Can't offline last IUCV enabled cpu. */
- return notifier_from_errno(-EINVAL);
- smp_call_function_single(cpu, iucv_retrieve_cpu, NULL, 1);
- if (cpumask_empty(&iucv_irq_cpumask))
- smp_call_function_single(
- cpumask_first(&iucv_buffer_cpumask),
- iucv_allow_cpu, NULL, 1);
- break;
- }
- return NOTIFY_OK;
-}
+ if (!iucv_path_table)
+ return 0;
-static struct notifier_block __refdata iucv_cpu_notifier = {
- .notifier_call = iucv_cpu_notify,
-};
+ cpumask_copy(&cpumask, &iucv_buffer_cpumask);
+ cpumask_clear_cpu(cpu, &cpumask);
+ if (cpumask_empty(&cpumask))
+ /* Can't offline last IUCV enabled cpu. */
+ return -EINVAL;
+
+ iucv_retrieve_cpu(NULL);
+ if (!cpumask_empty(&iucv_irq_cpumask))
+ return 0;
+ smp_call_function_single(cpumask_first(&iucv_buffer_cpumask),
+ iucv_allow_cpu, NULL, 1);
+ return 0;
+}
/**
* iucv_sever_pathid
@@ -2027,6 +2008,7 @@ struct iucv_interface iucv_if = {
};
EXPORT_SYMBOL(iucv_if);
+static enum cpuhp_state iucv_online;
/**
* iucv_init
*
@@ -2035,7 +2017,6 @@ EXPORT_SYMBOL(iucv_if);
static int __init iucv_init(void)
{
int rc;
- int cpu;
if (!MACHINE_IS_VM) {
rc = -EPROTONOSUPPORT;
@@ -2054,23 +2035,19 @@ static int __init iucv_init(void)
goto out_int;
}
- cpu_notifier_register_begin();
-
- for_each_online_cpu(cpu) {
- if (alloc_iucv_data(cpu)) {
- rc = -ENOMEM;
- goto out_free;
- }
- }
- rc = __register_hotcpu_notifier(&iucv_cpu_notifier);
+ rc = cpuhp_setup_state(CPUHP_NET_IUCV_PREPARE, "net/iucv:prepare",
+ iucv_cpu_prepare, iucv_cpu_dead);
if (rc)
- goto out_free;
-
- cpu_notifier_register_done();
+ goto out_dev;
+ rc = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "net/iucv:online",
+ iucv_cpu_online, iucv_cpu_down_prep);
+ if (rc < 0)
+ goto out_prep;
+ iucv_online = rc;
rc = register_reboot_notifier(&iucv_reboot_notifier);
if (rc)
- goto out_cpu;
+ goto out_remove_hp;
ASCEBC(iucv_error_no_listener, 16);
ASCEBC(iucv_error_no_memory, 16);
ASCEBC(iucv_error_pathid, 16);
@@ -2084,15 +2061,11 @@ static int __init iucv_init(void)
out_reboot:
unregister_reboot_notifier(&iucv_reboot_notifier);
-out_cpu:
- cpu_notifier_register_begin();
- __unregister_hotcpu_notifier(&iucv_cpu_notifier);
-out_free:
- for_each_possible_cpu(cpu)
- free_iucv_data(cpu);
-
- cpu_notifier_register_done();
-
+out_remove_hp:
+ cpuhp_remove_state(iucv_online);
+out_prep:
+ cpuhp_remove_state(CPUHP_NET_IUCV_PREPARE);
+out_dev:
root_device_unregister(iucv_root);
out_int:
unregister_external_irq(EXT_IRQ_IUCV, iucv_external_interrupt);
@@ -2110,7 +2083,6 @@ static int __init iucv_init(void)
static void __exit iucv_exit(void)
{
struct iucv_irq_list *p, *n;
- int cpu;
spin_lock_irq(&iucv_queue_lock);
list_for_each_entry_safe(p, n, &iucv_task_queue, list)
@@ -2119,11 +2091,9 @@ static void __exit iucv_exit(void)
kfree(p);
spin_unlock_irq(&iucv_queue_lock);
unregister_reboot_notifier(&iucv_reboot_notifier);
- cpu_notifier_register_begin();
- __unregister_hotcpu_notifier(&iucv_cpu_notifier);
- for_each_possible_cpu(cpu)
- free_iucv_data(cpu);
- cpu_notifier_register_done();
+
+ cpuhp_remove_state_nocalls(iucv_online);
+ cpuhp_remove_state(CPUHP_NET_IUCV_PREPARE);
root_device_unregister(iucv_root);
bus_unregister(&iucv_bus);
unregister_external_irq(EXT_IRQ_IUCV, iucv_external_interrupt);
^ permalink raw reply
* Re: [PATCH] net: ieee802154: drop duplicate header delay.h
From: Stefan Schmidt @ 2016-11-24 14:32 UTC (permalink / raw)
To: Geliang Tang, Michael Hennerich, Alexander Aring
Cc: linux-wpan, netdev, linux-kernel
In-Reply-To: <f64943b9c1da12b6199ba745ba04cb477a05a5a3.1479991128.git.geliangtang@gmail.com>
Hello.
On 24/11/16 14:58, Geliang Tang wrote:
> Drop duplicate header delay.h from adf7242.c.
>
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>
> ---
> drivers/net/ieee802154/adf7242.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/ieee802154/adf7242.c b/drivers/net/ieee802154/adf7242.c
> index 4ff4c7d..3e4c8b2 100644
> --- a/drivers/net/ieee802154/adf7242.c
> +++ b/drivers/net/ieee802154/adf7242.c
> @@ -20,7 +20,6 @@
> #include <linux/skbuff.h>
> #include <linux/of.h>
> #include <linux/irq.h>
> -#include <linux/delay.h>
> #include <linux/debugfs.h>
> #include <linux/bitops.h>
> #include <linux/ieee802154.h>
>
Good catch.
Acked-by: Stefan Schmidt <stefan@osg.samsung.com>
regards
Stefan Schmidt
^ permalink raw reply
* [net-next PATCH v1 0/2] stmmac: dwmac-meson8b: configurable RGMII TX delay
From: Martin Blumenstingl @ 2016-11-24 14:34 UTC (permalink / raw)
To: linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
devicetree-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
davem-fT/PcQaiUtIeIZ0/mPfg9Q, khilman-rdvid1DuHRBWk0Htik3J/w,
mark.rutland-5wv7dgnIgG8, robh+dt-DgEjT+Ai2ygdnm+yROfE0A
Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
alexandre.torgue-qxv4g6HH51o, peppe.cavallaro-qxv4g6HH51o,
carlo-KA+7E9HrN00dnm+yROfE0A, jbrunet-rdvid1DuHRBWk0Htik3J/w,
Martin Blumenstingl
Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
cycle TX clock delay. This seems to work fine for many boards (for
example Odroid-C2 or Amlogic's reference boards) but there are some
others where TX traffic is simply broken.
There are probably multiple reasons why it's working on some boards
while it's broken on others:
- some of Amlogic's reference boards are using a Micrel PHY
- hardware circuit design
- maybe more...
This raises a question though:
Which device is supposed to enable the TX delay when both MAC and PHY
support it? And should we implement it for each PHY / MAC separately
or should we think about a more generic solution (currently it's not
possible to disable the TX delay generated by the RTL8211F PHY via
devicetree when using phy-mode "rgmii")?
iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
TX clock delay disabled on the MAC (as it's enabled in the PHY driver).
TX throughput was virtually zero before:
$ iperf3 -c 192.168.1.100 -R
Connecting to host 192.168.1.100, port 5201
Reverse mode, remote host 192.168.1.100 is sending
[ 4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port 5201
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 108 MBytes 901 Mbits/sec
[ 4] 1.00-2.00 sec 94.2 MBytes 791 Mbits/sec
[ 4] 2.00-3.00 sec 96.5 MBytes 810 Mbits/sec
[ 4] 3.00-4.00 sec 96.2 MBytes 808 Mbits/sec
[ 4] 4.00-5.00 sec 96.6 MBytes 810 Mbits/sec
[ 4] 5.00-6.00 sec 96.5 MBytes 810 Mbits/sec
[ 4] 6.00-7.00 sec 96.6 MBytes 810 Mbits/sec
[ 4] 7.00-8.00 sec 96.5 MBytes 809 Mbits/sec
[ 4] 8.00-9.00 sec 105 MBytes 884 Mbits/sec
[ 4] 9.00-10.00 sec 111 MBytes 934 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1000 MBytes 839 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 998 MBytes 837 Mbits/sec receiver
iperf Done.
$ iperf3 -c 192.168.1.100
Connecting to host 192.168.1.100, port 5201
[ 4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.01 sec 99.5 MBytes 829 Mbits/sec 117 139 KBytes
[ 4] 1.01-2.00 sec 105 MBytes 884 Mbits/sec 129 70.7 KBytes
[ 4] 2.00-3.01 sec 107 MBytes 889 Mbits/sec 106 187 KBytes
[ 4] 3.01-4.01 sec 105 MBytes 878 Mbits/sec 92 143 KBytes
[ 4] 4.01-5.00 sec 105 MBytes 882 Mbits/sec 140 129 KBytes
[ 4] 5.00-6.01 sec 106 MBytes 883 Mbits/sec 115 195 KBytes
[ 4] 6.01-7.00 sec 102 MBytes 863 Mbits/sec 133 70.7 KBytes
[ 4] 7.00-8.01 sec 106 MBytes 884 Mbits/sec 143 97.6 KBytes
[ 4] 8.01-9.01 sec 104 MBytes 875 Mbits/sec 124 107 KBytes
[ 4] 9.01-10.01 sec 105 MBytes 876 Mbits/sec 90 139 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.01 sec 1.02 GBytes 874 Mbits/sec 1189 sender
[ 4] 0.00-10.01 sec 1.02 GBytes 873 Mbits/sec receiver
iperf Done.
Martin Blumenstingl (2):
net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable
Documentation/devicetree/bindings/net/meson-dwmac.txt | 11 +++++++++++
drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 16 +++++++++++-----
include/dt-bindings/net/dwmac-meson8b.h | 18 ++++++++++++++++++
3 files changed, 40 insertions(+), 5 deletions(-)
create mode 100644 include/dt-bindings/net/dwmac-meson8b.h
--
2.10.2
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [net-next PATCH v1 1/2] net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
From: Martin Blumenstingl @ 2016-11-24 14:34 UTC (permalink / raw)
To: linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
devicetree-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
davem-fT/PcQaiUtIeIZ0/mPfg9Q, khilman-rdvid1DuHRBWk0Htik3J/w,
mark.rutland-5wv7dgnIgG8, robh+dt-DgEjT+Ai2ygdnm+yROfE0A
Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
alexandre.torgue-qxv4g6HH51o, peppe.cavallaro-qxv4g6HH51o,
carlo-KA+7E9HrN00dnm+yROfE0A, jbrunet-rdvid1DuHRBWk0Htik3J/w,
Martin Blumenstingl
In-Reply-To: <20161124143417.10178-1-martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
This allows configuring the RGMII TX clock delay. This clock is
generated by the Meson 8b / GXBB DWMAC glue. The configuration depends
on the actual hardware (no delay may be needed due to the design of the
actual circuit, the PHY might add this delay, etc.).
The configuration values are provided as preprocessor macros to make the
devicetree files easier to read.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
---
Documentation/devicetree/bindings/net/meson-dwmac.txt | 11 +++++++++++
include/dt-bindings/net/dwmac-meson8b.h | 18 ++++++++++++++++++
2 files changed, 29 insertions(+)
create mode 100644 include/dt-bindings/net/dwmac-meson8b.h
diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt b/Documentation/devicetree/bindings/net/meson-dwmac.txt
index 89e62dd..fe526d0 100644
--- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt
@@ -25,6 +25,17 @@ Required properties on Meson8b and newer:
- "clkin0" - first parent clock of the internal mux
- "clkin1" - second parent clock of the internal mux
+Optional properties on Meson8b and newer:
+- amlogic,tx-delay: The internal RGMII TX clock delay configuration.
+ Defaults to DWMAC_MESON8B_TXDLY_QUARTER_CYCLE
+ when not given. All possible values are defined
+ as preprocessor macro in
+ <dt-bindings/net/dwmac-meson8b.h>.
+ The delay is specified as divider for the
+ internal clock (RGMII typically uses a 125MHz
+ clock clock (= 8ns per cycle), so setting
+ DWMAC_MESON8B_TXDLY_QUARTER_CYCLE
+ results in a TX delay of 8ns/4 = 2ns.
Example for Meson6:
diff --git a/include/dt-bindings/net/dwmac-meson8b.h b/include/dt-bindings/net/dwmac-meson8b.h
new file mode 100644
index 0000000..4fc149e
--- /dev/null
+++ b/include/dt-bindings/net/dwmac-meson8b.h
@@ -0,0 +1,18 @@
+/*
+ * Devicetree constants for the Amlogic Meson8b and GXBB DWMAC glue layer
+ *
+ * Copyright (C) 2016 Martin Blumenstingl <martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* TX delay configuration */
+#define DWMAC_MESON8B_TXDLY_OFF 0x0
+#define DWMAC_MESON8B_TXDLY_QUARTER_CYCLE 0x1
+#define DWMAC_MESON8B_TXDLY_HALF_CYCLE 0x2
+#define DWMAC_MESON8B_TXDLY_THREE_QUARTER_CYCLE 0x3
--
2.10.2
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [net-next PATCH v1 2/2] net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable
From: Martin Blumenstingl @ 2016-11-24 14:34 UTC (permalink / raw)
To: linux-amlogic, devicetree, netdev, davem, khilman, mark.rutland,
robh+dt
Cc: linux-arm-kernel, alexandre.torgue, peppe.cavallaro, carlo,
jbrunet, Martin Blumenstingl
In-Reply-To: <20161124143417.10178-1-martin.blumenstingl@googlemail.com>
Prior to this patch we were using a hardcoded RGMII TX clock delay of
1/4 cycle (= 2ns). This value works for many boards, but unfortunately
not for all (due to the way the actual circuit is designed, sometimes
because the TX delay is enabled in the PHY, etc.).
Making the TX delay on the MAC side configurable allows us to support
all possible hardware combinations (which may or not be out there).
This allows fixing a compatibility issue on some boards, where the
RTL8211F PHY is configured to generate the TX delay. We can now turn
off the TX delay in the MAC, because otherwise we would be applying the
delay twice (which results in non-working TX traffic).
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
---
drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 250e4ce..1697d1a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -23,6 +23,8 @@
#include <linux/platform_device.h>
#include <linux/stmmac.h>
+#include <dt-bindings/net/dwmac-meson8b.h>
+
#include "stmmac_platform.h"
#define PRG_ETH0 0x0
@@ -35,10 +37,6 @@
#define PRG_ETH0_TXDLY_SHIFT 5
#define PRG_ETH0_TXDLY_MASK GENMASK(6, 5)
-#define PRG_ETH0_TXDLY_OFF (0x0 << PRG_ETH0_TXDLY_SHIFT)
-#define PRG_ETH0_TXDLY_QUARTER (0x1 << PRG_ETH0_TXDLY_SHIFT)
-#define PRG_ETH0_TXDLY_HALF (0x2 << PRG_ETH0_TXDLY_SHIFT)
-#define PRG_ETH0_TXDLY_THREE_QUARTERS (0x3 << PRG_ETH0_TXDLY_SHIFT)
/* divider for the result of m250_sel */
#define PRG_ETH0_CLK_M250_DIV_SHIFT 7
@@ -69,6 +67,8 @@ struct meson8b_dwmac {
struct clk_divider m25_div;
struct clk *m25_div_clk;
+
+ u32 tx_dly;
};
static void meson8b_dwmac_mask_bits(struct meson8b_dwmac *dwmac, u32 reg,
@@ -198,7 +198,7 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac *dwmac)
/* TX clock delay - all known boards use a 1/4 cycle delay */
meson8b_dwmac_mask_bits(dwmac, PRG_ETH0, PRG_ETH0_TXDLY_MASK,
- PRG_ETH0_TXDLY_QUARTER);
+ dwmac->tx_dly << PRG_ETH0_TXDLY_SHIFT);
break;
case PHY_INTERFACE_MODE_RMII:
@@ -279,6 +279,12 @@ static int meson8b_dwmac_probe(struct platform_device *pdev)
return -EINVAL;
}
+ ret = of_property_read_u32(pdev->dev.of_node, "amlogic,tx-delay",
+ &dwmac->tx_dly);
+ if (ret)
+ /* default to 1/4 cycle (= 2ns for RGMII) */
+ dwmac->tx_dly = DWMAC_MESON8B_TXDLY_QUARTER_CYCLE;
+
ret = meson8b_init_clk(dwmac);
if (ret)
return ret;
--
2.10.2
^ permalink raw reply related
* [PATCH] net: stmmac: enable tx queue 0 for gmac4 IPs synthesized with multiple TX queues
From: Niklas Cassel @ 2016-11-24 14:36 UTC (permalink / raw)
To: Giuseppe Cavallaro, Alexandre Torgue; +Cc: Niklas Cassel, netdev, linux-kernel
From: Niklas Cassel <niklas.cassel@axis.com>
The dwmac4 IP can synthesized with 1-8 number of tx queues.
On an IP synthesized with DWC_EQOS_NUM_TXQ > 1, all txqueues are disabled
by default. For these IPs, the bitfield TXQEN is R/W.
Always enable tx queue 0. The write will have no effect on IPs synthesized
with DWC_EQOS_NUM_TXQ == 1.
The driver does still not utilize more than one tx queue in the IP.
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
---
drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 3 +++
drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 12 +++++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
index 6f4f5ce25114..3e8d4fefa5e0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -155,8 +155,11 @@ enum power_event {
#define MTL_CHAN_RX_DEBUG(x) (MTL_CHANX_BASE_ADDR(x) + 0x38)
#define MTL_OP_MODE_RSF BIT(5)
+#define MTL_OP_MODE_TXQEN BIT(3)
#define MTL_OP_MODE_TSF BIT(1)
+#define MTL_OP_MODE_TQS_MASK GENMASK(24, 16)
+
#define MTL_OP_MODE_TTC_MASK 0x70
#define MTL_OP_MODE_TTC_SHIFT 4
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index 116151cd6a95..577316de6ba8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -213,7 +213,17 @@ static void dwmac4_dma_chan_op_mode(void __iomem *ioaddr, int txmode,
else
mtl_tx_op |= MTL_OP_MODE_TTC_512;
}
-
+ /* For an IP with DWC_EQOS_NUM_TXQ == 1, the fields TXQEN and TQS are RO
+ * with reset values: TXQEN on, TQS == DWC_EQOS_TXFIFO_SIZE.
+ * For an IP with DWC_EQOS_NUM_TXQ > 1, the fields TXQEN and TQS are R/W
+ * with reset values: TXQEN off, TQS 256 bytes.
+ *
+ * Write the bits in both cases, since it will have no effect when RO.
+ * For DWC_EQOS_NUM_TXQ > 1, the top bits in MTL_OP_MODE_TQS_MASK might
+ * be RO, however, writing the whole TQS field will result in a value
+ * equal to DWC_EQOS_TXFIFO_SIZE, just like for DWC_EQOS_NUM_TXQ == 1.
+ */
+ mtl_tx_op |= MTL_OP_MODE_TXQEN | MTL_OP_MODE_TQS_MASK;
writel(mtl_tx_op, ioaddr + MTL_CHAN_TX_OP_MODE(channel));
mtl_rx_op = readl(ioaddr + MTL_CHAN_RX_OP_MODE(channel));
--
2.1.4
^ permalink raw reply related
* Re: [RFC PATCH net v2 0/3] Fix OdroidC2 Gigabit Tx link issue
From: Martin Blumenstingl @ 2016-11-24 14:40 UTC (permalink / raw)
To: Jerome Brunet
Cc: netdev, devicetree, Florian Fainelli, Carlo Caione, Kevin Hilman,
Giuseppe Cavallaro, Alexandre TORGUE, Andre Roth, Neil Armstrong,
linux-amlogic, linux-arm-kernel, linux-kernel
In-Reply-To: <1479742524-30222-1-git-send-email-jbrunet@baylibre.com>
Hi Jerome,
On Mon, Nov 21, 2016 at 4:35 PM, Jerome Brunet <jbrunet@baylibre.com> wrote:
> This patchset fixes an issue with the OdroidC2 board (DWMAC + RTL8211F).
> Initially reported as a low Tx throughput issue at gigabit speed, the
> platform enters LPI too often. This eventually break the link (both Tx
> and Rx), and require to bring the interface down and up again to get the
> Rx path working again.
>
> The root cause of this issue is not fully understood yet but disabling EEE
> advertisement on the PHY prevent this feature to be negotiated.
> With this change, the link is stable and reliable, with the expected
> throughput performance.
I have just sent a series which allows configuring the TX delay on the
MAC (dwmac-meson8b glue) side: [0]
Disabling the TX delay generated by the MAC fixes TX throughput for
me, even when leaving EEE enabled in the RTL8211F PHY driver!
Unfortunately the RTL8211F PHY is a black-box for the community
because there is no public datasheeet available.
*maybe* (pure speculation!) they're enabling the TX delay based on
some internal magic only when EEE is enabled.
Jerome, could you please re-test the behavior on your Odroid-C2 when
you have EEE still enabled but the TX-delay disabled?
In my case throughput is fine, and "$ ethtool -S eth0 | grep lpi" gives:
irq_tx_path_in_lpi_mode_n: 0
irq_tx_path_exit_lpi_mode_n: 0
irq_rx_path_in_lpi_mode_n: 0
irq_rx_path_exit_lpi_mode_n: 0
Regards,
Martin
[0] http://lists.infradead.org/pipermail/linux-amlogic/2016-November/001674.html
^ permalink raw reply
* Re: [PATCH net-next 1/4] net: mvneta: Convert to be 64 bits compatible
From: Gregory CLEMENT @ 2016-11-24 15:01 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linux-arm-kernel, Jisheng Zhang, Marcin Wojtas, Thomas Petazzoni,
Andrew Lunn, Jason Cooper, netdev, linux-kernel, David S. Miller,
Sebastian Hesselbarth
In-Reply-To: <21520380.oWTKcrq8DS@wuerfel>
Hi Arnd,
On jeu., nov. 24 2016, Arnd Bergmann <arnd@arndb.de> wrote:
> On Thursday, November 24, 2016 4:37:36 PM CET Jisheng Zhang wrote:
>> solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
>> such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
>> rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
>> device isn't cache-coherent. I didn't measure the performance difference,
>> because in fact we take solA as well internally. From your experience,
>> can the performance gain deserve the complex code?
>
> Yes, a read from uncached memory is fairly slow, so if you have a chance
> to avoid that it will probably help. When adding complexity to the code,
> it probably makes sense to take a runtime profile anyway quantify how
> much it gains.
>
> On machines that have cache-coherent DMA, accessing the descriptor
> should be fine, as you already have to load the entire cache line
> to read the status field.
>
> Looking at this snippet:
>
> rx_status = rx_desc->status;
> rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> data = (unsigned char *)rx_desc->buf_cookie;
> phys_addr = rx_desc->buf_phys_addr;
> pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
> bm_pool = &pp->bm_priv->bm_pools[pool_id];
>
> if (!mvneta_rxq_desc_is_first_last(rx_status) ||
> (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
> err_drop_frame_ret_pool:
> /* Return the buffer to the pool */
> mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> rx_desc->buf_phys_addr);
> err_drop_frame:
>
>
> I think there is more room for optimizing if you start: you read
> the status field twice (the second one in MVNETA_RX_GET_BM_POOL_ID)
> and you can cache the buf_phys_addr along with the virtual address
> once you add that.
I agree we can optimize this code but it is not related to the 64 bits
conversion. Indeed this part is running when we use the HW buffer
management, however currently this part is not ready at all for 64
bits. The virtual address is directly handled by the hardware but it has
only 32 bits to store it in the cookie. So if we want to use the HWBM in
64 bits we need to redesign the code, (maybe by storing the virtual
address in a array and pass the index in the cookie).
Gregory
>
> Generally speaking, I'd recommend using READ_ONCE()/WRITE_ONCE()
> to access the descriptor fields, to ensure the compiler doesn't
> add extra references as well as to annotate the expensive
> operations.
>
> Arnd
--
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
^ permalink raw reply
* Re: [PATCH iproute2 0/2] tc/cls_flower: Support for ip tunnel metadata set/release/classify
From: Amir Vadai @ 2016-11-24 15:06 UTC (permalink / raw)
To: Jiri Benc
Cc: Stephen Hemminger, David S. Miller, netdev, Or Gerlitz,
Hadar Har-Zion, Roi Dayan
In-Reply-To: <20161124143856.43fa54d6@griffin>
On Thu, Nov 24, 2016 at 02:38:56PM +0100, Jiri Benc wrote:
> On Mon, 21 Nov 2016 12:20:54 +0200, Amir Vadai wrote:
> > $ tc filter add dev vxlan0 protocol ip parent ffff: \
> > flower \
> > enc_src_ip 11.11.0.2 \
> > enc_dst_ip 11.11.0.1 \
> > enc_key_id 11 \
> > dst_ip 11.11.11.1 \
> > action tunnel_key release \
> > action mirred egress redirect dev vnet0
>
> I really hate the "action tunnel_key release". This just exposes the
> kernel internal implementation detail (dst_metadata) to the user. Why
> should the user care about explicit releasing of the tunnel key? This
> should happen automatically. Users do not care about our internal
> implementation.
I see.
So you mean to just unconditionally call skb_dst_drop() from
act_mirred()?
>
> > $ tc filter add dev net0 protocol ip parent ffff: \
> > flower \
> > ip_proto 1 \
> > dst_ip 11.11.11.2 \
> > action tunnel_key set \
> > src_ip 11.11.0.1 \
> > dst_ip 11.11.0.2 \
> > id 11 \
> > action mirred egress redirect dev vxlan0
>
> Do you see the asymmetry? This is not called "alloc tunnel_key", and
> rightly so. It's very reasonable to call this "set", as it is what the
> action looks like to the user.
>
> The only argument for the existence of an explicit "release" (we should
> rather call it "unset" in such case, though) is forwarding between two
> tunnels, where metadata from the first tunnel will be used for
> encapsulation done by the second tunnel. Or a similar case when there's
> classification based on the tunnel metadata done on the mirred
> interface. Somewhat corner cases, though. If we want to support them,
> then let's call the action "unset" and not "release". And in any case,
> it should not be mandatory to specify it, which should be made clear
> in the documentation (including examples where it is needed - basically
> only when forwarding between tunnels).
The use case we already have that uses the release action is the
hardware offload support, which is already in the kernel.
It is using the "tunnel_key release" action to signal the hardware to
strip off the ip tunnel headers.
I need to go over this again and see how can we make it work without the
release/unset action.
>
> Jiri
^ permalink raw reply
* [PATCH V3 net-next 01/15] net: introduce keepalive function in struct proto
From: Ursula Braun @ 2016-11-24 15:06 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, utz.bacher,
ubraun
In-Reply-To: <20161124150645.90881-1-ubraun@linux.vnet.ibm.com>
Direct call of tcp_set_keepalive() function from protocol-agnostic
sock_setsockopt() function in net/core/sock.c violates network
layering. And newly introduced protocol (SMC-R) will need its own
keepalive function. Therefore, add "keepalive" function pointer
to "struct proto", and call it from sock_setsockopt() via this pointer.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
---
include/net/sock.h | 1 +
net/core/sock.c | 7 ++-----
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_timer.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
5 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 442cbb1..cb4359b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -993,6 +993,7 @@ struct proto {
int (*getsockopt)(struct sock *sk, int level,
int optname, char __user *optval,
int __user *option);
+ void (*keepalive)(struct sock *sk, int valbool);
#ifdef CONFIG_COMPAT
int (*compat_setsockopt)(struct sock *sk,
int level,
diff --git a/net/core/sock.c b/net/core/sock.c
index 14e6145..ac8137d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -762,11 +762,8 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
goto set_rcvbuf;
case SO_KEEPALIVE:
-#ifdef CONFIG_INET
- if (sk->sk_protocol == IPPROTO_TCP &&
- sk->sk_type == SOCK_STREAM)
- tcp_set_keepalive(sk, valbool);
-#endif
+ if (sk->sk_prot->keepalive)
+ sk->sk_prot->keepalive(sk, valbool);
sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
break;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5555eb8..70f5524 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2375,6 +2375,7 @@ struct proto tcp_prot = {
.shutdown = tcp_shutdown,
.setsockopt = tcp_setsockopt,
.getsockopt = tcp_getsockopt,
+ .keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
.sendpage = tcp_sendpage,
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3ea1cf8..9b1602a 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -617,6 +617,7 @@ void tcp_set_keepalive(struct sock *sk, int val)
else if (!val)
inet_csk_delete_keepalive_timer(sk);
}
+EXPORT_SYMBOL(tcp_set_keepalive);
static void tcp_keepalive_timer (unsigned long data)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 28ec0a2..5d4f58f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1886,6 +1886,7 @@ struct proto tcpv6_prot = {
.shutdown = tcp_shutdown,
.setsockopt = tcp_setsockopt,
.getsockopt = tcp_getsockopt,
+ .keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
.sendpage = tcp_sendpage,
--
2.8.4
^ permalink raw reply related
* [PATCH V3 net-next 00/15] net/smc: Shared Memory Communications - RDMA
From: Ursula Braun @ 2016-11-24 15:06 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, utz.bacher,
ubraun
Dave,
here is now V3 of the SMC-R patches having processed your feedback from end
of September. The most important change is the replacement of procfs by a
netlink solution in patch 15 similar to sock_diag and inet_diag.
New checkpatch warnings are resolved.
V3 changes:
Patch 05: Remove unneeded DEFINE_WAIT
Patch 06: Improve synchronization of link group creation
Patch 07: Rename peer_rmbe_len into peer_rmbe_size to be more consistent
Patch 09: Avoid calls of ib_get_memory_region with IB_ACCESS_LOCAL_WRITE,
use new default local_dma_lkey from protection domain as lkey
instead.
Remove no longer needed function smc_ib_dereg_memory_region().
Patch 14: Switch to state ACTIVE only if still in state INIT.
Return 0 for recvmsg invoked in a socket closing state.
Allow getname call in state APPCLOSEWAIT1
Do not trigger destruction of a socket-in-error queued in accept
queue.
During cleanup of accept queue, make sure sockets are destructed,
and sockets in fallback mode are handled appropriately.
When freeing sndbufs/rmbs, remove them from their list and free
the entry.
Use add_wait_queue() and remove_wait_queue() in close wait
functions.
If actively closing a socket in state for PEERFINCLOSEWAIT, keep
this state.
If passively closing a socket while bytes are to be received, move
to state APPCLOSEWAIT1.
If actively aborting a socket, skip sending the close_abort flag,
since RDMA communication is no longer possible.
When terminating a link group, do not schedule link group freeing a
2nd time, since already done when unregistering the last remaining
connection.
Patch 15: Introduce smc_diag module for monitoring SMC protocol sockets.
This replaces the old patch 0015 dealing with procfs.
V2 changes:
Patch 0002: Add SMC versions for family key strings in net/core/sock.c.
Patch 0006: initialize rb_tree.
Patch 0007: Get rid of unneeded use of xchg() in smc_sndbuf_unuse() and
smc_rmb_unuse().
Patch 0008: Correct error checking logic for ib_function calls.
Define struct smc_link field wr_tx_id as atomic_long_t.
Use "do_div" instead of "%" to be architecture-independent.
Patch 0009: Correct error checking logic for ib_function calls.
Patch 0011: Remove xchg() calls in cursor handling. Use atomic64_t for cursor
overlays on 64-bit architectures. If not available, use plain u64
and add locking for cursor reading and writing.
Implement smc_curs_add() without modulo operator "%".
Patch 0012: Remove xchg() calls in cursor handling.
Implement smc_tx_rdma_writes() without module operator "%".
Patch 0013: Remove xchg() calls in cursor handling.
Patch 0014: Return type bool in smc_wr_tx_has_pending().
Remove unneeded semicolon in smc_close_shutdown_write().
Call smc_close_active() in non-fallback case only.
Get rid of duplicate schedule of sock_put_work().
Take nested sock_lock in smc_listen_work().
Start close stream_wait in case of prepared sends only.
Patch 0015: Remove unneeded socket ref_count in smc_proc_seq_show().
Take lock before list_empty check in smc_proc_sock_list_del().
These patches are the initial part of the implementation of the
"Shared Memory Communications-RDMA" (SMC-R) protocol as defined in
RFC7609 [1]. While SMC-R does not aim to replace TCP,
it taps a wealth of existing data center TCP socket applications
to become more efficient without the need for rewriting them.
SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption.
For instance, when running 10 parallel connections with uperf, we measured
a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
(with throughput and latency comparable;
measured on x86_64 with the same RoCE card and port).
SMC-R does not require an RDMA communication manager (RDMA CM).
SMC-R inherits TCP qualities such as reliable connections, host-based
firewall packet filtering (on connection establishment) and unmodified
application of communication encryption such as TLS (transport layer
security) or SSL (secure sockets layer). Since original TCP is used to
establish SMC-R connections, load balancers and packet inspection based
on TCP/IP connection establishment continue to work for SMC-R.
On the other hand, using SMC-R implies:
- either involving a preload library when invoking the unchanged TCP-application
or slightly modifying the source by simply changing the socket family in
the socket() call
- accepting extra overhead and latency in connection establishment due to
SMC Connection Layer Control (CLC) handshake
- explicit coupling of RoCE ports with Ethernet ports
- not routable as currently built on RoCE V1
- bypassing of packet-based networking features
- filtering (netfilter)
- sniffing (libpcap, packet sockets, (E)BPF)
- traffic control (scheduling, shaping)
- bypassing of IP-header based socket options
- bypassing of memory buffer (pressure) management
- unusable together with IPsec
Overview of the SMC-R Protocol described in informational RFC 7609
SMC-R is an open protocol that provides RDMA capabilities over RoCE
transparently for applications exploiting TCP sockets.
A new socket protocol family PF_SMC is introduced.
There are no changes required to applications using the sockets API for TCP
stream sockets other than the specification of the new socket family AF_SMC.
Unmodified applications can be used by means of a dynamic preload shared
library which rewrites the socket API call
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into
socket(AF_SMC, SOCK_STREAM, IPPROTO_TCP).
SMC-R re-uses the address family AF_INET for all addressing purposes around
struct sockaddr.
SMC-R system architecture layers:
+=============================================================================+
| | unmodified TCP application |
| native SMC application +--------------------------------------+
| | dynamic preload shared library |
+=============================================================================+
| SMC socket |
+-----------------------------------------------------------------------------+
| | TCP socket (for connection establishment and fallback) |
| IB verbs +--------------------------------------------------------+
| | IP |
+--------------------+--------------------------------------------------------+
| RoCE device driver | some network device driver |
+=============================================================================+
Terms:
A link group is determined by an ordered peer pair of TCP client and TCP server
(IP addresses and subnet). Reversed client server roles cause an own link group.
A link is a logical point-to-point connection based on an
infiniband reliable connected queue pair (RC-QP) between two RoCE ports
(MACs and GIDs) of a peer pair.
A link group can have 1..8 links for failover and load balancing.
This initial Linux implementation always has 1 link per link group.
Each link group on a peer can have 1..255 remote memory buffers (RMBs).
If more RMBs are needed, a peer can open another link group
(this initial Linux implementation) or fall back to TCP.
Each RMB has its own particular size and its own (R)DMA mapping and credentials
(rtoken consisting of rkey and RDMA "virtual address").
This initial Linux implementation uses physically contiguous memory for RMBs
but we are working towards scattered memory because of memory fragmentation.
Each RMB has 1..255 RMB elements (RMBEs) of equal size
to provide multiplexing of connections within an RMB.
An RMBE is the RDMA Write destination organized as wrapping ring buffer
for data transmit of a particular connection in one direction
(duplex by means of mirror symmetry as with TCP).
This initial Linux implementation always has 1 RMBE per RMB
and thus an individual RMB for each connection.
SMC-R connection establishment with subsequent data transfer:
CLIENT SERVER
TCP three-way handshake:
regular TCP SYN
-------------------------------------------------------->
regular TCP SYN ACK
<--------------------------------------------------------
regular TCP ACK
-------------------------------------------------------->
SMC Connection Layer Control (CLC) handshake
exchanges RDMA credentials between peers:
via above TCP connection: SMC CLC Proposal
-------------------------------------------------------->
via above TCP connection: SMC CLC Accept
<--------------------------------------------------------
via above TCP connection: SMC CLC Confirm
-------------------------------------------------------->
SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group):
RoCE RC-QP: SMC LLC Confirm Link
<========================================================
RoCE RC-QP: SMC LLC Confirm Link response
========================================================>
SMC data transmission (incl. SMC Connection Data Control (CDC) message):
RoCE RC-QP: RDMA Write
========================================================>
RoCE RC-QP: SMC CDC message (flow control)
========================================================>
...
RoCE RC-QP: RDMA Write
<========================================================
RoCE RC-QP: SMC CDC message (flow control)
<========================================================
...
Data flow within an established connection:
+----------------------------------------------------------------------------
| SENDER
| sendmsg()
| |
| | produces into sndbuf [sender's process context]
| v
| +--------+
| | sndbuf | [ring buffer]
| +--------+
| |
| | consumes from sndbuf and produces into receiver's RMBE [any context]
| | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP
| |
+----|-----------------------------------------------------------------------
|
+----|-----------------------------------------------------------------------
| v RECEIVER
| +------+
| | RMBE | [ring buffer, can have size different from sender's sndbuf]
| | | [RMBE represents rcvbuf, no further de-coupling as on sender side]
| +------+
| |
| | consumes from RMBE [receiver's process context]
| v
| recvmsg()
+----------------------------------------------------------------------------
Flow control ("cursor" updates) by means of SMC CDC messages:
SENDER RECEIVER
sends updates via CDC-------------+ sends updates via CDC
on consuming from sndbuf | on consuming from RMBE
and producing into RMBE | by means of recvmsg()
| |
| |
+-----------------------------------|------------+
| |
+--v-------------------------+ +--v-----------------------+
| receiver's consumer cursor | | sender's producer cursor----+
+----------------|-----------+ +--------------------------+ |
| |
| receiver's RMBE |
| +--------------------------+ |
| | | |
+--------------------------------+ | |
| | | |
| v | |
| +------------| |
|-------------+////////////| |
|//RDMA data written by////| |
|////sender that is////////| |
|/available to be consumed/| |
|///////// +---------------| |
|----------+^ | |
| | | |
| +-----------------+
| |
+--------------------------+
Sending updates of the producer cursor is immediate for low latency;
something like Nagle's algorithm (absence of TCP_NODELAY) is optional and
currently not part of this initial Linux implementation.
Sending updates of the consumer cursor is conditional to avoid the
silly window syndrome.
Normal connection termination:
Normal connection termination starts transitioning from socket state
ACTIVE via either "Active Close" or "Passive Close".
shutdown rdwr +-----------------+
or close, +-------------->| INIT / CLOSED |<-------------+
send PeerCon|nClosed +-----------------+ | PeerConnClosed
| | | received
| connection | established |
| V |
+----------------+ +-----------------+ +----------------+
|AppFinCloseWait | | ACTIVE | |PeerFinCloseWait|
+----------------+ +-----------------+ +----------------+
| | | |
| Active Close: | |Passive Close: |
| close or | |PeerConnClosed or |
| shutdown wr or| |PeerDoneWriting |
| shutdown rdwr | |received |
| V V |
PeerConnClo|sed +--------------+ +-------------+ | close or
received +--<----|PeerCloseWait1| |AppCloseWait1|--->----+ shutdown rdwr,
| +--------------+ +-------------+ | send
| PeerDoneWri|ting | shutdown wr, | PeerConnClosed
| received | send Pee|rDoneWriting |
| V V |
| +--------------+ +-------------+ |
+--<----|PeerCloseWait2| |AppCloseWait2|--->----+
+--------------+ +-------------+
In state CLOSED, the socket can be destructed only, once the application has
issued a close().
Abnormal connection termination:
+-----------------+
+-------------->| INIT / CLOSED |<-------------+
| +-----------------+ |
| |
| +-----------------------+ |
| | Any state | |
PeerConnAbo|rt | (before setting | | send
received | | PeerConnClosed | | PeerConnAbort
| | indicator in | |
| | peer's RMBE) | |
| +-----------------------+ |
| | | |
| Active Abort: | | Passive Abort: |
| problem, | | PeerConnAbort |
| send | | received, |
| PeerConnAbort,| | ECONNRESET |
| ECONNABORTED | | |
| V V |
| +--------------+ +--------------+ |
+-------|PeerAbortWait | | ProcessAbort |------+
+--------------+ +--------------+
Implementation notes beyond RFC 7609:
A PNET table in sysfs provides the mapping between network device names and
RoCE Infiniband device names for the transparent switch of data communication.
A PNET table can contain an arbitrary number of PNETIDs.
Each PNETID contains exactly one (Ethernet) network device name
and one or more RoCE Infiniband device names.
Each device name can only exist in at most one PNETID (no overlapping).
This initial Linux implementation allows at most one RoCE Infiniband device
name per PNETID.
After a new TCP connection is established, the network device name
used for egress traffic with the TCP connection's local source IP address
is used as key to lookup the unique PNETID, and the RoCE Infiniband device
of this PNETID is used to switch data communication from TCP to RDMA
during SMC CLC handshake.
Problem determination:
A protocol dissector is available with upstream wireshark for formatting
SMC-R related RoCE LAN traffic.
[https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c]
We are working on enhancing the Linux implementation to cover:
- Improve default socket closing asynchronicity
- Address corner cases with many parallel connections
- Tracing
- Integrated load balancing and fail-over within a link group
- Splice and sendpage support
- IPv6 addressing support
- Keepalive, Cork
- Namespaces support
- Urgent data
- More socket options
- Diagnostics
- Statistics support
- SNMP support
References:
[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
Thomas Richter (1):
smc: establish pnet table management
Ursula Braun (14):
net: introduce keepalive function in struct proto
smc: establish new socket family
smc: introduce SMC as an IB-client
smc: CLC handshake (incl. preparation steps)
smc: connection and link group creation
smc: remote memory buffers (RMBs)
smc: work request (WR) base for use by LLC and CDC
smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR
smc: link layer control (LLC)
smc: connection data control (CDC)
smc: send data (through RDMA)
smc: receive data from RMBE
smc: socket closing and linkgroup cleanup
smc: netlink interface for SMC sockets
MAINTAINERS | 7 +
include/linux/socket.h | 7 +-
include/net/smc.h | 20 +
include/net/sock.h | 4 +
include/uapi/linux/netlink.h | 1 +
include/uapi/linux/smc_diag.h | 85 +++
net/Kconfig | 1 +
net/Makefile | 1 +
net/core/sock.c | 13 +-
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_timer.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
net/smc/Kconfig | 20 +
net/smc/Makefile | 4 +
net/smc/af_smc.c | 1417 +++++++++++++++++++++++++++++++++++++++++
net/smc/smc.h | 272 ++++++++
net/smc/smc_cdc.c | 302 +++++++++
net/smc/smc_cdc.h | 218 +++++++
net/smc/smc_clc.c | 281 ++++++++
net/smc/smc_clc.h | 116 ++++
net/smc/smc_close.c | 442 +++++++++++++
net/smc/smc_close.h | 28 +
net/smc/smc_core.c | 675 ++++++++++++++++++++
net/smc/smc_core.h | 179 ++++++
net/smc/smc_diag.c | 215 +++++++
net/smc/smc_ib.c | 479 ++++++++++++++
net/smc/smc_ib.h | 69 ++
net/smc/smc_llc.c | 158 +++++
net/smc/smc_llc.h | 63 ++
net/smc/smc_pnet.c | 611 ++++++++++++++++++
net/smc/smc_pnet.h | 27 +
net/smc/smc_rx.c | 217 +++++++
net/smc/smc_rx.h | 23 +
net/smc/smc_tx.c | 483 ++++++++++++++
net/smc/smc_tx.h | 35 +
net/smc/smc_wr.c | 614 ++++++++++++++++++
net/smc/smc_wr.h | 106 +++
37 files changed, 7187 insertions(+), 9 deletions(-)
create mode 100644 include/net/smc.h
create mode 100644 include/uapi/linux/smc_diag.h
create mode 100644 net/smc/Kconfig
create mode 100644 net/smc/Makefile
create mode 100644 net/smc/af_smc.c
create mode 100644 net/smc/smc.h
create mode 100644 net/smc/smc_cdc.c
create mode 100644 net/smc/smc_cdc.h
create mode 100644 net/smc/smc_clc.c
create mode 100644 net/smc/smc_clc.h
create mode 100644 net/smc/smc_close.c
create mode 100644 net/smc/smc_close.h
create mode 100644 net/smc/smc_core.c
create mode 100644 net/smc/smc_core.h
create mode 100644 net/smc/smc_diag.c
create mode 100644 net/smc/smc_ib.c
create mode 100644 net/smc/smc_ib.h
create mode 100644 net/smc/smc_llc.c
create mode 100644 net/smc/smc_llc.h
create mode 100644 net/smc/smc_pnet.c
create mode 100644 net/smc/smc_pnet.h
create mode 100644 net/smc/smc_rx.c
create mode 100644 net/smc/smc_rx.h
create mode 100644 net/smc/smc_tx.c
create mode 100644 net/smc/smc_tx.h
create mode 100644 net/smc/smc_wr.c
create mode 100644 net/smc/smc_wr.h
--
2.8.4
^ permalink raw reply
* [PATCH V3 net-next 02/15] smc: establish new socket family
From: Ursula Braun @ 2016-11-24 15:06 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, utz.bacher,
ubraun
In-Reply-To: <20161124150645.90881-1-ubraun@linux.vnet.ibm.com>
* enable smc module loading and unloading
* register new socket family
* basic smc socket creation and deletion
* use backing TCP socket to run CLC (Connection Layer Control)
handshake of SMC protocol
* Setup for infiniband traffic is implemented in follow-on patches.
For now fallback to TCP socket is always used.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
---
MAINTAINERS | 7 +
include/linux/socket.h | 7 +-
net/Kconfig | 1 +
net/Makefile | 1 +
net/core/sock.c | 6 +-
net/smc/Kconfig | 11 +
net/smc/Makefile | 2 +
net/smc/af_smc.c | 622 +++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc.h | 37 +++
9 files changed, 690 insertions(+), 4 deletions(-)
create mode 100644 net/smc/Kconfig
create mode 100644 net/smc/Makefile
create mode 100644 net/smc/af_smc.c
create mode 100644 net/smc/smc.h
diff --git a/MAINTAINERS b/MAINTAINERS
index e589ae6..927e6f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10612,6 +10612,13 @@ S: Maintained
F: drivers/staging/media/st-cec/
F: Documentation/devicetree/bindings/media/stih-cec.txt
+SHARED MEMORY COMMUNICATIONS (SMC) SOCKETS
+M: Ursula Braun <ubraun@linux.vnet.ibm.com>
+L: linux-s390@vger.kernel.org
+W: http://www.ibm.com/developerworks/linux/linux390/
+S: Supported
+F: net/smc/
+
SYNOPSYS DESIGNWARE DMAC DRIVER
M: Viresh Kumar <vireshk@kernel.org>
M: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..a4a1cc7 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8 +202,12 @@ struct ucred {
#define AF_VSOCK 40 /* vSockets */
#define AF_KCM 41 /* Kernel Connection Multiplexor*/
#define AF_QIPCRTR 42 /* Qualcomm IPC Router */
+#define AF_SMC 43 /* smc sockets: reserve number for
+ * PF_SMC protocol family that
+ * reuses AF_INET address family
+ */
-#define AF_MAX 43 /* For now.. */
+#define AF_MAX 44 /* For now.. */
/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
@@ -251,6 +255,7 @@ struct ucred {
#define PF_VSOCK AF_VSOCK
#define PF_KCM AF_KCM
#define PF_QIPCRTR AF_QIPCRTR
+#define PF_SMC AF_SMC
#define PF_MAX AF_MAX
/* Maximum queue length specifiable by listen. */
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..c6f611e 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -57,6 +57,7 @@ source "net/packet/Kconfig"
source "net/unix/Kconfig"
source "net/xfrm/Kconfig"
source "net/iucv/Kconfig"
+source "net/smc/Kconfig"
config INET
bool "TCP/IP networking"
diff --git a/net/Makefile b/net/Makefile
index 4cafaa2..5d6e0e5f 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_MAC80211) += mac80211/
obj-$(CONFIG_TIPC) += tipc/
obj-$(CONFIG_NETLABEL) += netlabel/
obj-$(CONFIG_IUCV) += iucv/
+obj-$(CONFIG_SMC) += smc/
obj-$(CONFIG_RFKILL) += rfkill/
obj-$(CONFIG_NET_9P) += 9p/
obj-$(CONFIG_CAIF) += caif/
diff --git a/net/core/sock.c b/net/core/sock.c
index ac8137d..401e78f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -222,7 +222,7 @@ static const char *const af_family_key_strings[AF_MAX+1] = {
"sk_lock-AF_RXRPC" , "sk_lock-AF_ISDN" , "sk_lock-AF_PHONET" ,
"sk_lock-AF_IEEE802154", "sk_lock-AF_CAIF" , "sk_lock-AF_ALG" ,
"sk_lock-AF_NFC" , "sk_lock-AF_VSOCK" , "sk_lock-AF_KCM" ,
- "sk_lock-AF_MAX"
+ "sk_lock-AF_SMC" , "sk_lock-AF_MAX"
};
static const char *const af_family_slock_key_strings[AF_MAX+1] = {
"slock-AF_UNSPEC", "slock-AF_UNIX" , "slock-AF_INET" ,
@@ -239,7 +239,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
"slock-AF_RXRPC" , "slock-AF_ISDN" , "slock-AF_PHONET" ,
"slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
"slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_KCM" ,
- "slock-AF_MAX"
+ "slock-AF_SMC" , "slock-AF_MAX"
};
static const char *const af_family_clock_key_strings[AF_MAX+1] = {
"clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
@@ -256,7 +256,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
"clock-AF_RXRPC" , "clock-AF_ISDN" , "clock-AF_PHONET" ,
"clock-AF_IEEE802154", "clock-AF_CAIF" , "clock-AF_ALG" ,
"clock-AF_NFC" , "clock-AF_VSOCK" , "clock-AF_KCM" ,
- "clock-AF_MAX"
+ "closck-AF_smc" , "clock-AF_MAX"
};
/*
diff --git a/net/smc/Kconfig b/net/smc/Kconfig
new file mode 100644
index 0000000..bc02980
--- /dev/null
+++ b/net/smc/Kconfig
@@ -0,0 +1,11 @@
+config SMC
+ tristate "SMC socket protocol family"
+ depends on INET && INFINIBAND
+ ---help---
+ SMC-R provides a "sockets over RDMA" solution making use of
+ RDMA over Converged Ethernet (RoCE) technology to upgrade
+ AF_INET TCP connections transparently.
+ The Linux implementation of the SMC-R solution is designed as
+ a separate socket family SMC.
+
+ Select this option if you want to run SMC socket applications
diff --git a/net/smc/Makefile b/net/smc/Makefile
new file mode 100644
index 0000000..c285c86
--- /dev/null
+++ b/net/smc/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_SMC) += smc.o
+smc-y := af_smc.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
new file mode 100644
index 0000000..c4f0c41
--- /dev/null
+++ b/net/smc/af_smc.c
@@ -0,0 +1,622 @@
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * AF_SMC protocol family socket handler keeping the AF_INET sock address type
+ * applies to SOCK_STREAM sockets only
+ * offers an alternative communication option for TCP-protocol sockets
+ * applicable with RoCE-cards only
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
+ * based on prototype from Frank Blaschka
+ */
+
+#define KMSG_COMPONENT "smc"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/socket.h>
+#include <net/sock.h>
+
+#include "smc.h"
+
+static void smc_set_keepalive(struct sock *sk, int val)
+{
+ struct smc_sock *smc = smc_sk(sk);
+
+ smc->clcsock->sk->sk_prot->keepalive(smc->clcsock->sk, val);
+}
+
+static struct proto smc_proto = {
+ .name = "SMC",
+ .owner = THIS_MODULE,
+ .keepalive = smc_set_keepalive,
+ .obj_size = sizeof(struct smc_sock),
+ .slab_flags = SLAB_DESTROY_BY_RCU,
+};
+
+static int smc_release(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+
+ if (!sk)
+ goto out;
+
+ smc = smc_sk(sk);
+ lock_sock(sk);
+
+ sk->sk_state = SMC_CLOSED;
+ if (smc->clcsock) {
+ sock_release(smc->clcsock);
+ smc->clcsock = NULL;
+ }
+
+ /* detach socket */
+ sock_orphan(sk);
+ sock->sk = NULL;
+ release_sock(sk);
+
+ sock_put(sk);
+out:
+ return 0;
+}
+
+static void smc_destruct(struct sock *sk)
+{
+ if (sk->sk_state != SMC_CLOSED)
+ return;
+ if (!sock_flag(sk, SOCK_DEAD))
+ return;
+
+ sk_refcnt_debug_dec(sk);
+}
+
+static struct sock *smc_sock_alloc(struct net *net, struct socket *sock)
+{
+ struct smc_sock *smc;
+ struct sock *sk;
+
+ sk = sk_alloc(net, PF_SMC, GFP_KERNEL, &smc_proto, 0);
+ if (!sk)
+ return NULL;
+
+ sock_init_data(sock, sk); /* sets sk_refcnt to 1 */
+ sk->sk_state = SMC_INIT;
+ sk->sk_destruct = smc_destruct;
+ sk->sk_protocol = SMCPROTO_SMC;
+ sk_refcnt_debug_inc(sk);
+
+ smc = smc_sk(sk);
+ smc->clcsock = NULL;
+ smc->use_fallback = 0;
+
+ return sk;
+}
+
+static int smc_bind(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len)
+{
+ struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc;
+
+ smc = smc_sk(sk);
+
+ /* replicate tests from inet_bind(), to be safe wrt. future changes */
+ rc = -EINVAL;
+ if (addr_len < sizeof(struct sockaddr_in))
+ goto out;
+
+ rc = -EAFNOSUPPORT;
+ /* accept AF_UNSPEC (mapped to AF_INET) only if s_addr is INADDR_ANY */
+ if ((addr->sin_family != AF_INET) &&
+ ((addr->sin_family != AF_UNSPEC) ||
+ (addr->sin_addr.s_addr != htonl(INADDR_ANY))))
+ goto out;
+
+ lock_sock(sk);
+
+ /* Check if socket is already active */
+ rc = -EINVAL;
+ if (sk->sk_state != SMC_INIT)
+ goto out_rel;
+
+ smc->clcsock->sk->sk_reuse = sk->sk_reuse;
+ rc = kernel_bind(smc->clcsock, uaddr, addr_len);
+
+out_rel:
+ release_sock(sk);
+out:
+ return rc;
+}
+
+static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk,
+ unsigned long mask)
+{
+ /* options we don't get control via setsockopt for */
+ nsk->sk_type = osk->sk_type;
+ nsk->sk_sndbuf = osk->sk_sndbuf;
+ nsk->sk_rcvbuf = osk->sk_rcvbuf;
+ nsk->sk_sndtimeo = osk->sk_sndtimeo;
+ nsk->sk_rcvtimeo = osk->sk_rcvtimeo;
+ nsk->sk_mark = osk->sk_mark;
+ nsk->sk_priority = osk->sk_priority;
+ nsk->sk_rcvlowat = osk->sk_rcvlowat;
+ nsk->sk_bound_dev_if = osk->sk_bound_dev_if;
+ nsk->sk_err = osk->sk_err;
+
+ nsk->sk_flags &= ~mask;
+ nsk->sk_flags |= osk->sk_flags & mask;
+}
+
+#define SK_FLAGS_SMC_TO_CLC ((1UL << SOCK_URGINLINE) | \
+ (1UL << SOCK_KEEPOPEN) | \
+ (1UL << SOCK_LINGER) | \
+ (1UL << SOCK_BROADCAST) | \
+ (1UL << SOCK_TIMESTAMP) | \
+ (1UL << SOCK_DBG) | \
+ (1UL << SOCK_RCVTSTAMP) | \
+ (1UL << SOCK_RCVTSTAMPNS) | \
+ (1UL << SOCK_LOCALROUTE) | \
+ (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
+ (1UL << SOCK_RXQ_OVFL) | \
+ (1UL << SOCK_WIFI_STATUS) | \
+ (1UL << SOCK_NOFCS) | \
+ (1UL << SOCK_FILTER_LOCKED))
+/* copy only relevant settings and flags of SOL_SOCKET level from smc to
+ * clc socket (since smc is not called for these options from net/core)
+ */
+static void smc_copy_sock_settings_to_clc(struct smc_sock *smc)
+{
+ smc_copy_sock_settings(smc->clcsock->sk, &smc->sk, SK_FLAGS_SMC_TO_CLC);
+}
+
+#define SK_FLAGS_CLC_TO_SMC ((1UL << SOCK_URGINLINE) | \
+ (1UL << SOCK_KEEPOPEN) | \
+ (1UL << SOCK_LINGER) | \
+ (1UL << SOCK_DBG))
+/* copy only settings and flags relevant for smc from clc to smc socket */
+static void smc_copy_sock_settings_to_smc(struct smc_sock *smc)
+{
+ smc_copy_sock_settings(&smc->sk, smc->clcsock->sk, SK_FLAGS_CLC_TO_SMC);
+}
+
+static int smc_connect(struct socket *sock, struct sockaddr *addr,
+ int alen, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc = -EINVAL;
+
+ smc = smc_sk(sk);
+
+ /* separate smc parameter checking to be safe */
+ if (alen < sizeof(addr->sa_family))
+ goto out_err;
+ if (addr->sa_family != AF_INET)
+ goto out_err;
+
+ lock_sock(sk);
+ switch (sk->sk_state) {
+ default:
+ goto out;
+ case SMC_ACTIVE:
+ rc = -EISCONN;
+ goto out;
+ case SMC_INIT:
+ rc = 0;
+ break;
+ }
+
+ smc_copy_sock_settings_to_clc(smc);
+ rc = kernel_connect(smc->clcsock, addr, alen, flags);
+ if (rc)
+ goto out;
+
+ sk->sk_state = SMC_ACTIVE;
+
+ /* always use TCP fallback as transport mechanism for now;
+ * This will change once RDMA transport is implemented
+ */
+ smc->use_fallback = 1;
+
+out:
+ release_sock(sk);
+out_err:
+ return rc;
+}
+
+static int smc_clcsock_accept(struct smc_sock *lsmc, struct smc_sock **new_smc)
+{
+ struct sock *sk = &lsmc->sk;
+ struct socket *new_clcsock;
+ struct sock *new_sk;
+ int rc;
+
+ new_sk = smc_sock_alloc(sock_net(sk), NULL);
+ if (!new_sk) {
+ rc = -ENOMEM;
+ lsmc->sk.sk_err = ENOMEM;
+ *new_smc = NULL;
+ goto out;
+ }
+ *new_smc = smc_sk(new_sk);
+
+ rc = kernel_accept(lsmc->clcsock, &new_clcsock, 0);
+ if (rc) {
+ sock_put(new_sk);
+ *new_smc = NULL;
+ goto out;
+ }
+
+ (*new_smc)->clcsock = new_clcsock;
+out:
+ return rc;
+}
+
+static int smc_listen(struct socket *sock, int backlog)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc;
+
+ smc = smc_sk(sk);
+ lock_sock(sk);
+
+ rc = -EINVAL;
+ if ((sk->sk_state != SMC_INIT) && (sk->sk_state != SMC_LISTEN))
+ goto out;
+
+ rc = 0;
+ if (sk->sk_state == SMC_LISTEN) {
+ sk->sk_max_ack_backlog = backlog;
+ goto out;
+ }
+ /* some socket options are handled in core, so we could not apply
+ * them to the clc socket -- copy smc socket options to clc socket
+ */
+ smc_copy_sock_settings_to_clc(smc);
+
+ rc = kernel_listen(smc->clcsock, backlog);
+ if (rc)
+ goto out;
+ sk->sk_max_ack_backlog = backlog;
+ sk->sk_ack_backlog = 0;
+ sk->sk_state = SMC_LISTEN;
+
+out:
+ release_sock(sk);
+ return rc;
+}
+
+static int smc_accept(struct socket *sock, struct socket *new_sock,
+ int flags)
+{
+ struct smc_sock *new_smc;
+ struct sock *sk = sock->sk;
+ struct smc_sock *lsmc;
+ int rc;
+
+ lsmc = smc_sk(sk);
+ lock_sock(sk);
+
+ if (lsmc->sk.sk_state != SMC_LISTEN) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ rc = smc_clcsock_accept(lsmc, &new_smc);
+ if (rc)
+ goto out;
+ sock_graft(&new_smc->sk, new_sock);
+ new_smc->sk.sk_state = SMC_ACTIVE;
+
+ smc_copy_sock_settings_to_smc(new_smc);
+
+ /* always use TCP fallback as transport mechanism for now;
+ * This will change once RDMA transport is implemented
+ */
+ new_smc->use_fallback = 1;
+
+out:
+ release_sock(sk);
+ return rc;
+}
+
+static int smc_getname(struct socket *sock, struct sockaddr *addr,
+ int *len, int peer)
+{
+ struct smc_sock *smc;
+
+ if (peer && (sock->sk->sk_state != SMC_ACTIVE))
+ return -ENOTCONN;
+
+ smc = smc_sk(sock->sk);
+
+ return smc->clcsock->ops->getname(smc->clcsock, addr, len, peer);
+}
+
+static int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc = -EPIPE;
+
+ smc = smc_sk(sk);
+ lock_sock(sk);
+ if (sk->sk_state != SMC_ACTIVE)
+ goto out;
+ if (smc->use_fallback)
+ rc = smc->clcsock->ops->sendmsg(smc->clcsock, msg, len);
+ else
+ rc = sock_no_sendmsg(sock, msg, len);
+out:
+ release_sock(sk);
+ return rc;
+}
+
+static int smc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
+ int flags)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc = -ENOTCONN;
+
+ smc = smc_sk(sk);
+ lock_sock(sk);
+ if ((sk->sk_state != SMC_ACTIVE) && (sk->sk_state != SMC_CLOSED))
+ goto out;
+
+ if (smc->use_fallback)
+ rc = smc->clcsock->ops->recvmsg(smc->clcsock, msg, len, flags);
+ else
+ rc = sock_no_recvmsg(sock, msg, len, flags);
+out:
+ release_sock(sk);
+ return rc;
+}
+
+static unsigned int smc_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ struct sock *sk = sock->sk;
+ unsigned int mask = 0;
+ struct smc_sock *smc;
+
+ smc = smc_sk(sock->sk);
+ if ((sk->sk_state == SMC_INIT) || (sk->sk_state == SMC_LISTEN) ||
+ smc->use_fallback) {
+ mask = smc->clcsock->ops->poll(file, smc->clcsock, wait);
+ /* if non-blocking connect finished ... */
+ lock_sock(sk);
+ if ((sk->sk_state == SMC_INIT) && (mask & POLLOUT)) {
+ sk->sk_state = SMC_ACTIVE;
+ /* always use TCP fallback as transport mechanism;
+ * This will change once RDMA transport is implemented
+ */
+ smc->use_fallback = 1;
+ }
+ release_sock(sk);
+ } else {
+ mask = sock_no_poll(file, sock, wait);
+ }
+
+ return mask;
+}
+
+static int smc_shutdown(struct socket *sock, int how)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc = -EINVAL;
+
+ smc = smc_sk(sk);
+
+ if ((how < SHUT_RD) || (how > SHUT_RDWR))
+ goto out_err;
+
+ lock_sock(sk);
+
+ rc = -ENOTCONN;
+ if (sk->sk_state == SMC_CLOSED)
+ goto out;
+ if (smc->use_fallback) {
+ rc = kernel_sock_shutdown(smc->clcsock, how);
+ sk->sk_shutdown = smc->clcsock->sk->sk_shutdown;
+ if (sk->sk_shutdown == SHUTDOWN_MASK)
+ sk->sk_state = SMC_CLOSED;
+ } else {
+ rc = sock_no_shutdown(sock, how);
+ }
+
+out:
+ release_sock(sk);
+
+out_err:
+ return rc;
+}
+
+static int smc_setsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, unsigned int optlen)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+
+ smc = smc_sk(sk);
+
+ /* generic setsockopts reaching us here always apply to the
+ * CLC socket
+ */
+ return smc->clcsock->ops->setsockopt(smc->clcsock, level, optname,
+ optval, optlen);
+}
+
+static int smc_getsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int __user *optlen)
+{
+ struct smc_sock *smc;
+
+ smc = smc_sk(sock->sk);
+ /* socket options apply to the CLC socket */
+ return smc->clcsock->ops->getsockopt(smc->clcsock, level, optname,
+ optval, optlen);
+}
+
+static int smc_ioctl(struct socket *sock, unsigned int cmd,
+ unsigned long arg)
+{
+ struct smc_sock *smc;
+
+ smc = smc_sk(sock->sk);
+ if (smc->use_fallback)
+ return smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
+ else
+ return sock_no_ioctl(sock, cmd, arg);
+}
+
+static ssize_t smc_sendpage(struct socket *sock, struct page *page,
+ int offset, size_t size, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc = -EPIPE;
+
+ smc = smc_sk(sk);
+ lock_sock(sk);
+ if (sk->sk_state != SMC_ACTIVE)
+ goto out;
+ if (smc->use_fallback)
+ rc = kernel_sendpage(smc->clcsock, page, offset,
+ size, flags);
+ else
+ rc = sock_no_sendpage(sock, page, offset, size, flags);
+
+out:
+ release_sock(sk);
+ return rc;
+}
+
+static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
+{
+ struct sock *sk = sock->sk;
+ struct smc_sock *smc;
+ int rc = -ENOTCONN;
+
+ smc = smc_sk(sk);
+ lock_sock(sk);
+ if ((sk->sk_state != SMC_ACTIVE) && (sk->sk_state != SMC_CLOSED))
+ goto out;
+ if (smc->use_fallback) {
+ rc = smc->clcsock->ops->splice_read(smc->clcsock, ppos,
+ pipe, len, flags);
+ } else {
+ rc = -EOPNOTSUPP;
+ }
+out:
+ release_sock(sk);
+ return rc;
+}
+
+/* must look like tcp */
+static const struct proto_ops smc_sock_ops = {
+ .family = PF_SMC,
+ .owner = THIS_MODULE,
+ .release = smc_release,
+ .bind = smc_bind,
+ .connect = smc_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = smc_accept,
+ .getname = smc_getname,
+ .poll = smc_poll,
+ .ioctl = smc_ioctl,
+ .listen = smc_listen,
+ .shutdown = smc_shutdown,
+ .setsockopt = smc_setsockopt,
+ .getsockopt = smc_getsockopt,
+ .sendmsg = smc_sendmsg,
+ .recvmsg = smc_recvmsg,
+ .mmap = sock_no_mmap,
+ .sendpage = smc_sendpage,
+ .splice_read = smc_splice_read,
+};
+
+static int smc_create(struct net *net, struct socket *sock, int protocol,
+ int kern)
+{
+ struct smc_sock *smc;
+ struct sock *sk;
+ int rc;
+
+ rc = -ESOCKTNOSUPPORT;
+ if (sock->type != SOCK_STREAM)
+ goto out;
+
+ rc = -EPROTONOSUPPORT;
+ if ((protocol != IPPROTO_IP) && (protocol != IPPROTO_TCP))
+ goto out;
+
+ rc = -ENOBUFS;
+ sock->ops = &smc_sock_ops;
+ sk = smc_sock_alloc(net, sock);
+ if (!sk)
+ goto out;
+
+ /* create internal TCP socket for CLC handshake and fallback */
+ smc = smc_sk(sk);
+ rc = sock_create_kern(net, PF_INET, SOCK_STREAM,
+ IPPROTO_TCP, &smc->clcsock);
+ if (rc)
+ sk_common_release(sk);
+
+out:
+ return rc;
+}
+
+static const struct net_proto_family smc_sock_family_ops = {
+ .family = PF_SMC,
+ .owner = THIS_MODULE,
+ .create = smc_create,
+};
+
+static int __init smc_init(void)
+{
+ int rc;
+
+ rc = proto_register(&smc_proto, 1);
+ if (rc) {
+ pr_err("%s: proto_register fails with %d\n", __func__, rc);
+ goto out;
+ }
+
+ rc = sock_register(&smc_sock_family_ops);
+ if (rc) {
+ pr_err("%s: sock_register fails with %d\n", __func__, rc);
+ goto out_proto;
+ }
+
+ return 0;
+
+out_proto:
+ proto_unregister(&smc_proto);
+out:
+ return rc;
+}
+
+static void __exit smc_exit(void)
+{
+ sock_unregister(PF_SMC);
+ proto_unregister(&smc_proto);
+}
+
+module_init(smc_init);
+module_exit(smc_exit);
+
+MODULE_AUTHOR("Ursula Braun <ubraun@linux.vnet.ibm.com>");
+MODULE_DESCRIPTION("smc socket address family");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_SMC);
diff --git a/net/smc/smc.h b/net/smc/smc.h
new file mode 100644
index 0000000..508f639
--- /dev/null
+++ b/net/smc/smc.h
@@ -0,0 +1,37 @@
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * Definitions for the SMC module (socket related)
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
+ */
+#ifndef __SMC_H
+#define __SMC_H
+
+#include <linux/socket.h>
+#include <linux/types.h>
+#include <net/sock.h>
+
+#define SMCPROTO_SMC 0 /* SMC protocol */
+
+enum smc_state { /* possible states of an SMC socket */
+ SMC_ACTIVE = 1,
+ SMC_INIT = 2,
+ SMC_CLOSED = 7,
+ SMC_LISTEN = 10,
+};
+
+struct smc_sock { /* smc sock container */
+ struct sock sk;
+ struct socket *clcsock; /* internal tcp socket */
+ u8 use_fallback : 1; /* fallback to tcp */
+};
+
+static inline struct smc_sock *smc_sk(const struct sock *sk)
+{
+ return (struct smc_sock *)sk;
+}
+
+#endif /* __SMC_H */
--
2.8.4
^ permalink raw reply related
* [PATCH V3 net-next 04/15] smc: introduce SMC as an IB-client
From: Ursula Braun @ 2016-11-24 15:06 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, utz.bacher,
ubraun
In-Reply-To: <20161124150645.90881-1-ubraun@linux.vnet.ibm.com>
* create a list of SMC IB-devices (IB-devices mentioned in PNET table)
* determine RoCE device and port belonging to used internal TCP interface
according to the PNET table definitions
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
---
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 10 ++++
net/smc/smc.h | 4 ++
net/smc/smc_ib.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_ib.h | 40 ++++++++++++++
net/smc/smc_pnet.c | 98 +++++++++++++++++++++++++++++++++
net/smc/smc_pnet.h | 8 +++
7 files changed, 318 insertions(+), 1 deletion(-)
create mode 100644 net/smc/smc_ib.c
create mode 100644 net/smc/smc_ib.h
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 64dab53..50f39ff 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -1,2 +1,2 @@
obj-$(CONFIG_SMC) += smc.o
-smc-y := af_smc.o smc_pnet.o
+smc-y := af_smc.o smc_pnet.o smc_ib.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index a58d613..bb80e3a 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -20,6 +20,7 @@
#include <net/sock.h>
#include "smc.h"
+#include "smc_ib.h"
#include "smc_pnet.h"
static void smc_set_keepalive(struct sock *sk, int val)
@@ -604,8 +605,16 @@ static int __init smc_init(void)
goto out_proto;
}
+ rc = smc_ib_register_client();
+ if (rc) {
+ pr_err("%s: ib_register fails with %d\n", __func__, rc);
+ goto out_sock;
+ }
+
return 0;
+out_sock:
+ sock_unregister(PF_SMC);
out_proto:
proto_unregister(&smc_proto);
out_pnet:
@@ -615,6 +624,7 @@ static int __init smc_init(void)
static void __exit smc_exit(void)
{
+ smc_ib_unregister_client();
sock_unregister(PF_SMC);
proto_unregister(&smc_proto);
smc_pnet_exit();
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 508f639..7e6b5b4 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -34,4 +34,8 @@ static inline struct smc_sock *smc_sk(const struct sock *sk)
return (struct smc_sock *)sk;
}
+#define SMC_SYSTEMID_LEN 8
+
+extern u8 local_systemid[SMC_SYSTEMID_LEN]; /* unique system identifier */
+
#endif /* __SMC_H */
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
new file mode 100644
index 0000000..8b6bb50
--- /dev/null
+++ b/net/smc/smc_ib.c
@@ -0,0 +1,157 @@
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * IB infrastructure:
+ * Establish SMC-R as an Infiniband Client to be notified about added and
+ * removed IB devices of type RDMA.
+ * Determine device and port characteristics for these IB devices.
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
+ */
+
+#include <linux/random.h>
+#include <rdma/ib_verbs.h>
+
+#include "smc_pnet.h"
+#include "smc_ib.h"
+#include "smc.h"
+
+struct smc_ib_devices smc_ib_devices = { /* smc-registered ib devices */
+ .lock = __SPIN_LOCK_UNLOCKED(smc_ib_devices.lock),
+ .list = LIST_HEAD_INIT(smc_ib_devices.list),
+};
+
+#define SMC_LOCAL_SYSTEMID_RESET "%%%%%%%"
+
+u8 local_systemid[SMC_SYSTEMID_LEN] = SMC_LOCAL_SYSTEMID_RESET; /* unique system
+ * identifier
+ */
+
+static int smc_ib_fill_gid_and_mac(struct smc_ib_device *smcibdev, u8 ibport)
+{
+ struct net_device *ndev;
+ int rc;
+
+ rc = ib_query_gid(smcibdev->ibdev, ibport, 0,
+ &smcibdev->gid[ibport - 1], NULL);
+ /* the SMC protocol requires specification of the roce MAC address;
+ * if net_device cannot be determined, it can be derived from gid 0
+ */
+ ndev = smcibdev->ibdev->get_netdev(smcibdev->ibdev, ibport);
+ if (ndev) {
+ memcpy(&smcibdev->mac, ndev->dev_addr, ETH_ALEN);
+ } else if (!rc) {
+ memcpy(&smcibdev->mac[ibport - 1][0],
+ &smcibdev->gid[ibport - 1].raw[8], 3);
+ memcpy(&smcibdev->mac[ibport - 1][3],
+ &smcibdev->gid[ibport - 1].raw[13], 3);
+ smcibdev->mac[ibport - 1][0] &= ~0x02;
+ }
+ return rc;
+}
+
+/* Create an identifier unique for this instance of SMC-R.
+ * The MAC-address of the first active registered IB device
+ * plus a random 2-byte number is used to create this identifier.
+ * This name is delivered to the peer during connection initialization.
+ */
+static inline void smc_ib_define_local_systemid(struct smc_ib_device *smcibdev,
+ u8 ibport)
+{
+ memcpy(&local_systemid[2], &smcibdev->mac[ibport - 1],
+ sizeof(smcibdev->mac[ibport - 1]));
+ get_random_bytes(&local_systemid[0], 2);
+}
+
+bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
+{
+ return smcibdev->pattr[ibport - 1].state == IB_PORT_ACTIVE;
+}
+
+int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport)
+{
+ int rc;
+
+ memset(&smcibdev->pattr[ibport - 1], 0,
+ sizeof(smcibdev->pattr[ibport - 1]));
+ rc = ib_query_port(smcibdev->ibdev, ibport,
+ &smcibdev->pattr[ibport - 1]);
+ if (rc)
+ goto out;
+ rc = smc_ib_fill_gid_and_mac(smcibdev, ibport);
+ if (rc)
+ goto out;
+ if (!strncmp(local_systemid, SMC_LOCAL_SYSTEMID_RESET,
+ sizeof(local_systemid)) &&
+ smc_ib_port_active(smcibdev, ibport))
+ /* create unique system identifier */
+ smc_ib_define_local_systemid(smcibdev, ibport);
+out:
+ return rc;
+}
+
+static struct ib_client smc_ib_client;
+
+/* callback function for ib_register_client() */
+static void smc_ib_add_dev(struct ib_device *ibdev)
+{
+ struct smc_ib_device *smcibdev;
+ int i;
+
+ if (ibdev->node_type != RDMA_NODE_IB_CA)
+ return;
+
+ smcibdev = kzalloc(sizeof(*smcibdev), GFP_KERNEL);
+ if (!smcibdev)
+ return;
+
+ smcibdev->ibdev = ibdev;
+
+ for (i = 1; i <= SMC_MAX_PORTS; i++) {
+ if (smc_pnet_exists_in_table(smcibdev, i) &&
+ !smcibdev->initialized) {
+ /* dev hotplug: ib device and port is in pnet table */
+ if (smc_ib_remember_port_attr(smcibdev, i)) {
+ kfree(smcibdev);
+ return;
+ }
+ smcibdev->initialized = 1;
+ break;
+ }
+ }
+ spin_lock(&smc_ib_devices.lock);
+ list_add_tail(&smcibdev->list, &smc_ib_devices.list);
+ spin_unlock(&smc_ib_devices.lock);
+ ib_set_client_data(ibdev, &smc_ib_client, smcibdev);
+}
+
+/* callback function for ib_register_client() */
+static void smc_ib_remove_dev(struct ib_device *ibdev, void *client_data)
+{
+ struct smc_ib_device *smcibdev;
+
+ smcibdev = ib_get_client_data(ibdev, &smc_ib_client);
+ ib_set_client_data(ibdev, &smc_ib_client, NULL);
+ spin_lock(&smc_ib_devices.lock);
+ list_del_init(&smcibdev->list); /* remove from smc_ib_devices */
+ spin_unlock(&smc_ib_devices.lock);
+ kfree(smcibdev);
+}
+
+static struct ib_client smc_ib_client = {
+ .name = "smc_ib",
+ .add = smc_ib_add_dev,
+ .remove = smc_ib_remove_dev,
+};
+
+int __init smc_ib_register_client(void)
+{
+ return ib_register_client(&smc_ib_client);
+}
+
+void smc_ib_unregister_client(void)
+{
+ ib_unregister_client(&smc_ib_client);
+}
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
new file mode 100644
index 0000000..63613e7
--- /dev/null
+++ b/net/smc/smc_ib.h
@@ -0,0 +1,40 @@
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * Definitions for IB environment
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * Author(s): Ursula Braun <Ursula Braun@linux.vnet.ibm.com>
+ */
+
+#ifndef _SMC_IB_H
+#define _SMC_IB_H
+
+#include <rdma/ib_verbs.h>
+
+#define SMC_MAX_PORTS 2 /* Max # of ports */
+#define SMC_GID_SIZE sizeof(union ib_gid)
+
+struct smc_ib_devices { /* list of smc ib devices definition */
+ struct list_head list;
+ spinlock_t lock; /* protects list of smc ib devices */
+};
+
+extern struct smc_ib_devices smc_ib_devices; /* list of smc ib devices */
+
+struct smc_ib_device { /* ib-device infos for smc */
+ struct list_head list;
+ struct ib_device *ibdev;
+ struct ib_port_attr pattr[SMC_MAX_PORTS]; /* ib dev. port attrs */
+ char mac[SMC_MAX_PORTS][6]; /* mac address per port*/
+ union ib_gid gid[SMC_MAX_PORTS]; /* gid per port */
+ u8 initialized : 1; /* ib dev CQ, evthdl done */
+};
+
+int smc_ib_register_client(void) __init;
+void smc_ib_unregister_client(void);
+bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport);
+int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport);
+
+#endif
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index 4512a87..e007137 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -18,6 +18,7 @@
#include <rdma/ib_verbs.h>
+#include "smc_ib.h"
#include "smc_pnet.h"
#define SMC_MAX_PNET_ID_LEN 16 /* Max. length of PNET id */
@@ -185,6 +186,8 @@ static bool smc_pnet_same_ibname(struct smc_pnetentry *a, char *name, u8 ibport)
static int smc_pnet_add_ib(struct smc_pnetentry *pnetelem, char *name,
u8 ibport)
{
+ struct smc_ib_device *smcibdev = NULL;
+ struct smc_ib_device *dev;
struct smc_pnetentry *p;
int rc = -EEXIST;
@@ -196,10 +199,32 @@ static int smc_pnet_add_ib(struct smc_pnetentry *pnetelem, char *name,
if (pnetelem->ib_name[0] == '\0') {
strncpy(pnetelem->ib_name, name, sizeof(pnetelem->ib_name));
pnetelem->ib_port = ibport;
+ spin_lock(&smc_ib_devices.lock);
+ /* using string ib_name, search smcibdev in global list */
+ list_for_each_entry(dev, &smc_ib_devices.list, list) {
+ if (!strncmp(dev->ibdev->name, pnetelem->ib_name,
+ sizeof(pnetelem->ib_name))) {
+ smcibdev = dev;
+ break;
+ }
+ }
+ spin_unlock(&smc_ib_devices.lock);
rc = 0;
}
out:
write_unlock(&smc_pnettable.lock);
+ if (smcibdev && !smcibdev->initialized) {
+ /* ib dev already existed [dev coldplug].
+ * Complements: smc_ib_add_dev() [dev hotplug],
+ * smc_ib_global_event_handler() [port hotplug].
+ * Function call chain can sleep so outside of our locks.
+ */
+ rc = smc_ib_remember_port_attr(smcibdev,
+ pnetelem->ib_port);
+ if (rc)
+ return rc;
+ smcibdev->initialized = 1;
+ }
return rc;
}
@@ -508,3 +533,76 @@ int __init smc_pnet_init(void)
bad0:
return rc;
}
+
+/* Scan the pnet table and find an IB device given the pnetid entry.
+ * Return infiniband device and port number if an active port is found.
+ * This function is called under smc_pnettable.lock.
+ */
+static void smc_pnet_ib_dev_by_pnet(struct smc_pnetentry *pnetelem,
+ struct smc_ib_device **smcibdev, u8 *ibport)
+{
+ struct smc_ib_device *dev;
+
+ *smcibdev = NULL;
+ *ibport = 0;
+ spin_lock(&smc_ib_devices.lock);
+ /* using string ib->ib_name, search ibdev in global list */
+ list_for_each_entry(dev, &smc_ib_devices.list, list) {
+ if (!strncmp(dev->ibdev->name, pnetelem->ib_name,
+ sizeof(pnetelem->ib_name)) &&
+ smc_ib_port_active(dev, pnetelem->ib_port)) {
+ *smcibdev = dev;
+ *ibport = pnetelem->ib_port;
+ break;
+ }
+ }
+ spin_unlock(&smc_ib_devices.lock);
+}
+
+/* PNET table analysis for a given sock:
+ * determine ib_device and port belonging to used internal TCP socket
+ * ethernet interface.
+ */
+void smc_pnet_find_roce_resource(struct sock *sk,
+ struct smc_ib_device **smcibdev, u8 *ibport)
+{
+ struct dst_entry *dst = sk_dst_get(sk);
+ struct smc_pnetentry *pnetelem;
+
+ *smcibdev = NULL;
+ *ibport = 0;
+
+ if (!dst)
+ return;
+ if (!dst->dev)
+ goto out_rel;
+ read_lock(&smc_pnettable.lock);
+ list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) {
+ if (!strncmp(dst->dev->name, pnetelem->if_name, IFNAMSIZ)) {
+ smc_pnet_ib_dev_by_pnet(pnetelem, smcibdev, ibport);
+ break;
+ }
+ }
+ read_unlock(&smc_pnettable.lock);
+out_rel:
+ dst_release(dst);
+}
+
+/* Returns true if a specific ib_device and port is in the PNET table. */
+bool smc_pnet_exists_in_table(struct smc_ib_device *smcibdev, u8 ibport)
+{
+ struct smc_pnetentry *pnetelem;
+ int rc = false;
+
+ read_lock(&smc_pnettable.lock);
+ list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) {
+ if (!strncmp(smcibdev->ibdev->name, pnetelem->ib_name,
+ IB_DEVICE_NAME_MAX) &&
+ ibport == pnetelem->ib_port) {
+ rc = true;
+ break;
+ }
+ }
+ read_unlock(&smc_pnettable.lock);
+ return rc;
+}
diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
index 34f85f6..06dc307 100644
--- a/net/smc/smc_pnet.h
+++ b/net/smc/smc_pnet.h
@@ -13,6 +13,14 @@
#define SMC_MAX_PORTS 2 /* Max # of ports */
+#include <net/sock.h>
+
+struct smc_ib_device;
+
+bool smc_pnet_exists_in_table(struct smc_ib_device *smcibdev, u8 ibport);
+void smc_pnet_find_roce_resource(struct sock *sk,
+ struct smc_ib_device **smcibdev, u8 *ibport);
+
int smc_pnet_init(void) __init;
void smc_pnet_exit(void);
--
2.8.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox