Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH RFC] ipv6: Implement limits on hop by hop and destination options
From: Tom Herbert @ 2017-04-27 20:57 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

RFC 2460 (IPv6) defines hop by hop options and destination options
extension headers. Both of these carry a list of TLVs which is
only limited by the maximum length of the extension header (2048
bytes). By the spec a host must process all the TLVs in these
options, however these could be used as a fairly obvious
denial of service attack. I think this could in fact be
a significant DOS vector on the Internet, one mitigating
factor might be that many FWs drop all packets with EH (and
obviously this is only IPv6) so an Internet wide might not be so
effective (yet!).

By my calculation, the worse case packet with TLVs in a standard
1500 byte MTU packet that would be processed by the stack contains
1282 invidual TLVs (including pad TLVS) or 724 two byte TLVs. I
wrote a quick test program that floods a whole bunch of these
packets to a host and sure enough there is substantial time spent
in ip6_parse_tlv. These packets contain nothing but unknown TLVS
(that are ignored), TLV padding, and bogus UDP header with zero
payload length.

  25.38%  [kernel]                    [k] __fib6_clean_all
  21.63%  [kernel]                    [k] ip6_parse_tlv
   4.21%  [kernel]                    [k] __local_bh_enable_ip
   2.18%  [kernel]                    [k] ip6_pol_route.isra.39
   1.98%  [kernel]                    [k] fib6_walk_continue
   1.88%  [kernel]                    [k] _raw_write_lock_bh
   1.65%  [kernel]                    [k] dst_release

This patches adds configurable limits to destination and hop by hop
options. There are three limits that may be set:
  - Limit the number of non-padding TLVs that may be in an extension header
  - Limit the length of a hop by hop or destination options extension header
  - Disallow unknown options

The limits are set in corresponding sysctls:

  ipv6.sysctl.max_dst_opts_cnt
  ipv6.sysctl.max_hbh_opts_cnt
  ipv6.sysctl.max_dst_opts_len
  ipv6.sysctl.max_hbh_opts_len

If a max_*_opts_cnt is less than zero then unknown TLVs are disallowed.
The number of known TLVs that are allowed is the absolute value of
this number.

If a limit is exceeded when processing an extension header the packet is
dropped.

Default values are set to 8 for options counts, and set to INT_MAX
for maximum length. Note the choice to limit options to 8 is an
arbitrary guess (roughly based on the fact that the stack supports
three HBH options and just one destination option).

Tested: I've only complied this code, working on getting a test
environment set up which is why RFC. If anyone has resources and time
to do some testing or development, let me know!
---
 Documentation/networking/ip-sysctl.txt | 22 +++++++++++++++++
 include/net/ipv6.h                     | 33 +++++++++++++++++++++++++
 include/net/netns/ipv6.h               |  4 ++++
 net/ipv6/af_inet6.c                    |  4 ++++
 net/ipv6/exthdrs.c                     | 44 ++++++++++++++++++++++++++++++----
 net/ipv6/sysctl_net_ipv6.c             | 32 +++++++++++++++++++++++++
 6 files changed, 134 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 974ab47..476a5c5 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1379,6 +1379,28 @@ mld_qrv - INTEGER
 	Default: 2 (as specified by RFC3810 9.1)
 	Minimum: 1 (as specified by RFC6636 4.5)
 
+max_dst_opts_cnt - INTEGER
+	Maximum number of non-padding TLVs allowed in a destination
+	options extension header. If this value is less than zero
+	then unknown options are disallowed and the number of known
+	TLVs allowed are the absolute value of this numer.
+
+	Default: 8
+
+max_hbh_opts_cnt - INTEGER
+	Maximum number of non-padding TLVs allowed in a hop by hop
+	options extension header. If this value is less than zero
+	then unknown options are disallowed and the number of known
+	TLVs allowed are the absolute value of this number.
+
+max dst_opts_len - INTEGER
+	Maximum length allowed for a destination options extension
+	header.
+
+max hbh_opts_len - INTEGER
+	Maximum length allowed for a hop by hop options extension
+	header.
+
 IPv6 Fragmentation:
 
 ip6frag_high_thresh - INTEGER
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index dbf0abb..9f724ae 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -50,6 +50,39 @@
 #define IPV6_DEFAULT_HOPLIMIT   64
 #define IPV6_DEFAULT_MCASTHOPS	1
 
+/* Limits on hop by hop and destination options.
+ *
+ * Per RFC2640 there is no limit on the maximum number or lengths of TLVs in
+ * hop by hop or destination options other then the packet must fit in an MTU.
+ * We allow configurable limits in order to mitigate potential denial of
+ * service attacks.
+ *
+ * There are three limits that may be set:
+ *   - Limit the number of non-padding TLVs that may be in an extension header
+ *   - Limit the length of a hop by hop or destination options extension header
+ *   - Disallow unknown options
+ *
+ * The limits are set in corresponding sysctls:
+ *
+ * ipv6.sysctl.max_dst_opts_cnt
+ * ipv6.sysctl.max_hbh_opts_cnt
+ * ipv6.sysctl.max_dst_opts_len
+ * ipv6.sysctl.max_hbh_opts_len
+ *
+ * If a max_*_opts_cnt is less than zero then unknown TLVs are disallowed.
+ * The number of known TLVs that are allowed is the absolute value of
+ * this number.
+ *
+ * If a limit is exceeded when processing an extension header the packet is
+ * dropped.
+ */
+
+/* Default limits for hop by hop and destination options */
+#define IP6_DEFAULT_MAX_DST_OPTS_CNT	8
+#define IP6_DEFAULT_MAX_HBH_OPTS_CNT	8
+#define IP6_DEFAULT_MAX_DST_OPTS_LEN	INT_MAX /* No limit */
+#define IP6_DEFAULT_MAX_HBH_OPTS_LEN	INT_MAX /* No limit */
+
 /*
  *	Addr type
  *	
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index de7745e..655bd236 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -36,6 +36,10 @@ struct netns_sysctl_ipv6 {
 	int idgen_retries;
 	int idgen_delay;
 	int flowlabel_state_ranges;
+	int max_dst_opts_cnt;
+	int max_hbh_opts_cnt;
+	int max_dst_opts_len;
+	int max_hbh_opts_len;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index a88b5b5..38e1079 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -807,6 +807,10 @@ static int __net_init inet6_net_init(struct net *net)
 	net->ipv6.sysctl.idgen_retries = 3;
 	net->ipv6.sysctl.idgen_delay = 1 * HZ;
 	net->ipv6.sysctl.flowlabel_state_ranges = 0;
+	net->ipv6.sysctl.max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT;
+	net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
+	net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
+	net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
 	atomic_set(&net->ipv6.fib6_sernum, 1);
 
 	err = ipv6_init_mibs(net);
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index d32e211..d86aebf 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -100,13 +100,22 @@ static bool ip6_tlvopt_unknown(struct sk_buff *skb, int optoff)
 
 /* Parse tlv encoded option header (hop-by-hop or destination) */
 
-static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb)
+static bool ip6_parse_tlv(const struct tlvtype_proc *procs,
+			  struct sk_buff *skb,
+			  int max_count)
 {
 	const struct tlvtype_proc *curr;
 	const unsigned char *nh = skb_network_header(skb);
 	int off = skb_network_header_len(skb);
 	int len = (skb_transport_header(skb)[1] + 1) << 3;
 	int padlen = 0;
+	int tlv_count = 0;
+	bool disallow_unknowns = false;
+
+	if (unlikely(max_count < 0)) {
+		disallow_unknowns = true;
+		max_count = -max_count;
+	}
 
 	if (skb_transport_offset(skb) + len > skb_headlen(skb))
 		goto bad;
@@ -148,6 +157,11 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb)
 		default: /* Other TLV code so scan list */
 			if (optlen > len)
 				goto bad;
+
+			tlv_count++;
+			if (tlv_count > max_count)
+				goto bad;
+
 			for (curr = procs; curr->type >= 0; curr++) {
 				if (curr->type == nh[off]) {
 					/* type specific length/alignment
@@ -161,7 +175,10 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb)
 			if (curr->type < 0) {
 				if (ip6_tlvopt_unknown(skb, off) == 0)
 					return false;
+				if (disallow_unknowns)
+					goto bad;
 			}
+
 			padlen = 0;
 			break;
 		}
@@ -260,23 +277,31 @@ static int ipv6_destopt_rcv(struct sk_buff *skb)
 	__u16 dstbuf;
 #endif
 	struct dst_entry *dst = skb_dst(skb);
+	struct net *net = dev_net(skb->dev);
+	int extlen;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) ||
 	    !pskb_may_pull(skb, (skb_transport_offset(skb) +
 				 ((skb_transport_header(skb)[1] + 1) << 3)))) {
+fail_and_free:
 		__IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst),
 				IPSTATS_MIB_INHDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
 
+	extlen = (skb_transport_header(skb)[1] + 1) << 3;
+	if (extlen > net->ipv6.sysctl.max_dst_opts_len)
+		goto fail_and_free;
+
 	opt->lastopt = opt->dst1 = skb_network_header_len(skb);
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 	dstbuf = opt->dst1;
 #endif
 
-	if (ip6_parse_tlv(tlvprocdestopt_lst, skb)) {
-		skb->transport_header += (skb_transport_header(skb)[1] + 1) << 3;
+	if (ip6_parse_tlv(tlvprocdestopt_lst, skb,
+			  init_net.ipv6.sysctl.max_dst_opts_cnt)) {
+		skb->transport_header += extlen;
 		opt = IP6CB(skb);
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 		opt->nhoff = dstbuf;
@@ -804,6 +829,8 @@ static const struct tlvtype_proc tlvprochopopt_lst[] = {
 int ipv6_parse_hopopts(struct sk_buff *skb)
 {
 	struct inet6_skb_parm *opt = IP6CB(skb);
+	struct net *net = dev_net(skb->dev);
+	int extlen;
 
 	/*
 	 * skb_network_header(skb) is equal to skb->data, and
@@ -818,9 +845,16 @@ int ipv6_parse_hopopts(struct sk_buff *skb)
 		return -1;
 	}
 
+	extlen = (skb_transport_header(skb)[1] + 1) << 3;
+	if (extlen > net->ipv6.sysctl.max_dst_opts_len) {
+		kfree_skb(skb);
+		return -1;
+	}
+
 	opt->flags |= IP6SKB_HOPBYHOP;
-	if (ip6_parse_tlv(tlvprochopopt_lst, skb)) {
-		skb->transport_header += (skb_transport_header(skb)[1] + 1) << 3;
+	if (ip6_parse_tlv(tlvprochopopt_lst, skb,
+			  init_net.ipv6.sysctl.max_hbh_opts_cnt)) {
+		skb->transport_header += extlen;
 		opt = IP6CB(skb);
 		opt->nhoff = sizeof(struct ipv6hdr);
 		return 1;
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 69c50e7..054cabe 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -90,6 +90,34 @@ static struct ctl_table ipv6_table_template[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "max_dst_opts_number",
+		.data		= &init_net.ipv6.sysctl.max_dst_opts_cnt,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
+		.procname	= "max_hbh_opts_number",
+		.data		= &init_net.ipv6.sysctl.max_hbh_opts_cnt,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
+		.procname	= "max_dst_opts_length",
+		.data		= &init_net.ipv6.sysctl.max_dst_opts_len,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
+		.procname	= "max_hbh_length",
+		.data		= &init_net.ipv6.sysctl.max_hbh_opts_len,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
@@ -149,6 +177,10 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 	ipv6_table[6].data = &net->ipv6.sysctl.idgen_delay;
 	ipv6_table[7].data = &net->ipv6.sysctl.flowlabel_state_ranges;
 	ipv6_table[8].data = &net->ipv6.sysctl.ip_nonlocal_bind;
+	ipv6_table[9].data = &net->ipv6.sysctl.max_dst_opts_cnt;
+	ipv6_table[10].data = &net->ipv6.sysctl.max_hbh_opts_cnt;
+	ipv6_table[11].data = &net->ipv6.sysctl.max_dst_opts_len;
+	ipv6_table[12].data = &net->ipv6.sysctl.max_hbh_opts_len;
 
 	ipv6_route_table = ipv6_route_sysctl_init(net);
 	if (!ipv6_route_table)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Jason Gunthorpe @ 2017-04-27 20:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
	Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <df6586e2-7d45-6b0b-facb-4dea882df06e-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

On Thu, Apr 27, 2017 at 02:19:24PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 26/04/17 01:37 AM, Roger Pau Monné wrote:
> > On Tue, Apr 25, 2017 at 12:21:02PM -0600, Logan Gunthorpe wrote:
> >> Straightforward conversion to the new helper, except due to the lack
> >> of error path, we have to use SG_MAP_MUST_NOT_FAIL which may BUG_ON in
> >> certain cases in the future.
> >>
> >> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> >> Cc: Boris Ostrovsky <boris.ostrovsky-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> >> Cc: Juergen Gross <jgross-IBi9RG/b67k@public.gmane.org>
> >> Cc: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> >> Cc: "Roger Pau Monné" <roger.pau-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> >>  drivers/block/xen-blkfront.c | 20 +++++++++++---------
> >>  1 file changed, 11 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >> index 3945963..ed62175 100644
> >> +++ b/drivers/block/xen-blkfront.c
> >> @@ -816,8 +816,9 @@ static int blkif_queue_rw_req(struct request *req, struct blkfront_ring_info *ri
> >>  		BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> >>  
> >>  		if (setup.need_copy) {
> >> -			setup.bvec_off = sg->offset;
> >> -			setup.bvec_data = kmap_atomic(sg_page(sg));
> >> +			setup.bvec_off = 0;
> >> +			setup.bvec_data = sg_map(sg, 0, SG_KMAP_ATOMIC |
> >> +						 SG_MAP_MUST_NOT_FAIL);
> > 
> > I assume that sg_map already adds sg->offset to the address?
> 
> Correct.
> 
> > Also wondering whether we can get rid of bvec_off and just increment bvec_data,
> > adding Julien who IIRC added this code.
> 
> bvec_off is used to keep track of the offset within the current mapping
> so it's not a great idea given that you'd want to kunmap_atomic the
> original address and not something with an offset. It would be nice if
> this could be converted to use the sg_miter interface but that's a much
> more invasive change that would require someone who knows this code and
> can properly test it. I'd be very grateful if someone actually took that on.

blkfront is one of the drivers I looked at, and it appears to only be
memcpying with the bvec_data pointer, so I wonder why it does not use
sg_copy_X_buffer instead..

Jason

^ permalink raw reply

* Re: [PATCH net-next 0/5] qed*: PTP enhancements.
From: David Miller @ 2017-04-27 20:52 UTC (permalink / raw)
  To: sudarsana.kalluru; +Cc: richardcochran, netdev, Yuval.Mintz
In-Reply-To: <20170426160053.8356-1-sudarsana.kalluru@cavium.com>

From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Wed, 26 Apr 2017 09:00:48 -0700

> From: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
> 
> The patch series contains set of enhancements for qed/qede ptp
> implementation.
> Patches (1)-(3) adds resource locking implementation to allow 
> PTP functionality only on the first detected ethernet PF of the port.
> The change is required as the adapter currently supports only one
> instance of the PTP client on a given port.
> Patch (4) removes the un-needed header file.
> Patch (5) moves the ptt-lock get/release logic to the ptp specific
> code.
> 
> Please consider applying this series to "net-next" branch.

Series applied, thanks!

^ permalink raw reply

* Re: [PATCH net-next] net: vrf: Do not allow looback to be moved to a VRF
From: David Miller @ 2017-04-27 20:50 UTC (permalink / raw)
  To: dsa; +Cc: netdev, rshearma
In-Reply-To: <1493218702-10906-1-git-send-email-dsa@cumulusnetworks.com>

From: David Ahern <dsa@cumulusnetworks.com>
Date: Wed, 26 Apr 2017 07:58:22 -0700

> Moving the loopback into a VRF breaks networking for the default VRF.
> Since the VRF device is the loopback for VRF domains, there is no
> reason to move the loopback. Given the repercussions, block attempts
> to set lo into a VRF.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

Applied, thanks David.

^ permalink raw reply

* assembler mnenomics for call/tailcall plus maps...
From: David Miller @ 2017-04-27 20:42 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, xdp-newbies

Can you guys give me some kind of idea of how it might be nice to
represent calls and tailcalls in assembler files?

And also the emission of maps.

Right now I just have the assembler looking for 32-bit immediate
values for call and tailcall instructions.

Looking at samples/bpf/sockex3_kern.c we have:

struct bpf_map_def SEC("maps") jmp_table = {
	.type = BPF_MAP_TYPE_PROG_ARRAY,
	.key_size = sizeof(u32),
	.value_size = sizeof(u32),
	.max_entries = 8,
};

#define PARSE_VLAN 1
#define PARSE_MPLS 2
#define PARSE_IP 3
#define PARSE_IPV6 4

/* protocol dispatch routine.
 * It tail-calls next BPF program depending on eth proto
 * Note, we could have used:
 * bpf_tail_call(skb, &jmp_table, proto);
 * but it would need large prog_array
 */
static inline void parse_eth_proto(struct __sk_buff *skb, u32 proto)
{
	switch (proto) {
	case ETH_P_8021Q:
	case ETH_P_8021AD:
		bpf_tail_call(skb, &jmp_table, PARSE_VLAN);
		break;
	case ETH_P_MPLS_UC:
	case ETH_P_MPLS_MC:
		bpf_tail_call(skb, &jmp_table, PARSE_MPLS);
		break;
	case ETH_P_IP:
		bpf_tail_call(skb, &jmp_table, PARSE_IP);
		break;
	case ETH_P_IPV6:
		bpf_tail_call(skb, &jmp_table, PARSE_IPV6);
		break;
	}
}

and these bpf_tail_call() invocations seem to expand to something like:

	call	1 ! PARSE_VLAN
	call	2 ! PARSE_MPLS
	call	3 ! PARSE_IP
	call	4 ! PARSE_IPV6

in the resultant ELF file.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] fib_rules: fix error return code
From: David Miller @ 2017-04-27 20:36 UTC (permalink / raw)
  To: weiyj.lk
  Cc: lorenzo, idosch, dsa, mateusz.bajorski, johannes.berg,
	weiyongjun1, netdev
In-Reply-To: <20170426140350.25451-1-weiyj.lk@gmail.com>

From: Wei Yongjun <weiyj.lk@gmail.com>
Date: Wed, 26 Apr 2017 14:03:50 +0000

> From: Wei Yongjun <weiyongjun1@huawei.com>
> 
> Fix to return error code -EINVAL from the error handling
> case instead of 0, as done elsewhere in this function.
> 
> Fixes: 622ec2c9d524 ("net: core: add UID to flows, rules, and routes")
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
From: Casey Leedom @ 2017-04-27 20:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Alexander Duyck
  Cc: Ding Tianhong, Mark Rutland, Amir Ancel, Gabriele Paoloni,
	linux-pci@vger.kernel.org, Catalin Marinas, Will Deacon, LinuxArm,
	David Laight, jeffrey.t.kirsher@intel.com, netdev@vger.kernel.org,
	Robin Murphy, davem@davemloft.net,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20170427171938.GA10705@bhelgaas-glaptop.roam.corp.google.com>

| From: Bjorn Helgaas <helgaas@kernel.org>
| Sent: Thursday, April 27, 2017 10:19 AM
|
| Are you hinting that the PCI core or arch code could actually *enable*
| Relaxed Ordering without the driver doing anything?  Is it safe to do that?
| Is there such a thing as a device that is capable of using RO, but where the
| driver must be aware of it being enabled, so it programs the device
| appropriately?

  I forgot to reply to this portion of Bjorn's email.

  The PCI Configuration Space PCI Capability Device Control[Enable Relaxed
Ordering] bit governs enabling the _ability_ for the PCIe Device to send
TLPs with the Relaxed Ordering Attribute set.  It does not _cause_ RO to be
set on TLPs.  Doing that would almost certainly cause Data Corruption Bugs
since you only want a subset of TLPs to have RO set.

  For instance, we typically use RO for Ingress Packet Data delivery but
non-RO for messages notifying the Host that an Ingress Packet has been
delivered.  This ensures that the "Ingress Packet Delivered" non-RO TLP is
processed _after_ any preceding RO TLPs delivering the actual Ingress Packet
Data.

  In the above scenario, if one were to turn off Enable Relaxed Ordering via
the PCIe Capability, then the on-chip PCIe engine would simply never send a
TLP with the Relaxed Ordering Attribute set, regardless of any other chip
programming.

  And finally, just to be absolutely clear, using Relaxed Ordering isn't and
"Architecture Thing".  It's a PCIe Fabric End Point Thing.  Many End Points
simply ignore the Relaxed Ordering Attribute (except to reflect it back in
Response TLPs).  In this sense, Relaxed Ordering simply provides
potentially useful optimization information to the PCIe End Point.

Casey

^ permalink raw reply

* Re: [PATCH v2 net-next] bridge: add per-port broadcast flood flag
From: David Miller @ 2017-04-27 20:34 UTC (permalink / raw)
  To: mmanning; +Cc: netdev, nikolay
In-Reply-To: <1493214489-9921-1-git-send-email-mmanning@brocade.com>

From: Mike Manning <mmanning@brocade.com>
Date: Wed, 26 Apr 2017 14:48:09 +0100

> Support for l2 multicast flood control was added in commit b6cb5ac8331b
> ("net: bridge: add per-port multicast flood flag"). It allows broadcast
> as it was introduced specifically for unknown multicast flood control.
> But as broadcast is a special case of multicast, this may also need to
> be disabled. For this purpose, introduce a flag to disable the flooding
> of received l2 broadcasts. This approach is backwards compatible and
> provides flexibility in filtering for the desired packet types.
> 
> Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Signed-off-by: Mike Manning <mmanning@brocade.com>

Applied, thanks for following up on this.

^ permalink raw reply

* Re: [PATCH] net: fib: Decrease one unnecessary rt cache flush in fib_disable_ip
From: David Miller @ 2017-04-27 20:33 UTC (permalink / raw)
  To: gfree.wind; +Cc: netdev, fgao
In-Reply-To: <1493204644-88477-1-git-send-email-gfree.wind@foxmail.com>

From: gfree.wind@foxmail.com
Date: Wed, 26 Apr 2017 19:04:04 +0800

> From: Gao Feng <fgao@ikuai8.com>
> 
> The func fib_flush already flushes the rt cache if necessary, so it
> is not necessary to invoke rt_cache_flush again in fib_disable_ip.
> 
> Signed-off-by: Gao Feng <fgao@ikuai8.com>

Looks good, applied to net-next, thanks!

^ permalink raw reply

* Re: [PATCH net-next] l2tp: remove useless device duplication test in l2tp_eth_create()
From: David Miller @ 2017-04-27 20:32 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <95d793ebf07f9ec7aefa496ed2b1432aa9dc8aa8.1493199657.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Wed, 26 Apr 2017 11:54:47 +0200

> There's no need to verify that cfg->ifname is unique at this point.
> register_netdev() will return -EEXIST if asked to create a device with
> a name that's alrealy in use.
> 
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>

Yep that's right, applied, thanks!

^ permalink raw reply

* Re: [net-next] net: remove unnecessary carrier status check
From: David Miller @ 2017-04-27 20:31 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev
In-Reply-To: <1493200178-6413-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 26 Apr 2017 17:49:38 +0800

> Since netif_carrier_on() will do nothing if device's carrier is already
> on, so it's unnecessary to do carrier status check.
> 
> It's the same for netif_carrier_off().
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
From: kbuild test robot @ 2017-04-27 20:30 UTC (permalink / raw)
  To: Marco Chiappero
  Cc: kbuild-all, netdev, David S . Miller, Jeff Kirsher,
	Alexander Duyck, Sainath Grandhi, Mahesh Bandewar,
	Marco Chiappero
In-Reply-To: <20170427145142.15830-10-marco.chiappero@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

Hi Marco,

[auto build test ERROR on net/master]
[also build test ERROR on v4.11-rc8]
[cannot apply to net-next/master next-20170427]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Marco-Chiappero/support-unique-MAC-addresses-for-slave-devices/20170428-022313
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=tile 

All errors (new ones prefixed by >>):

   drivers/net//ipvlan/ipvlan_core.c: In function 'ipvlan_proxy_l2_update_icmp6':
>> drivers/net//ipvlan/ipvlan_core.c:246:5: error: implicit declaration of function 'csum_ipv6_magic'
   cc1: some warnings being treated as errors

vim +/csum_ipv6_magic +246 drivers/net//ipvlan/ipvlan_core.c

   240				struct icmp6hdr *icmph = icmp6_hdr(skb);
   241				u32 len = ntohs(ip6h->payload_len);
   242	
   243				memcpy(nd_opt + 1, master->dev_addr, master->addr_len);
   244				icmph->icmp6_cksum = 0;
   245				icmph->icmp6_cksum =
 > 246					csum_ipv6_magic(&ip6h->saddr,
   247							&ip6h->daddr, len,
   248							IPPROTO_ICMPV6,
   249							csum_partial(icmph, len, 0));

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 48036 bytes --]

^ permalink raw reply

* Re: [net-next] net: update comment for netif_dormant() function
From: David Miller @ 2017-04-27 20:23 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev
In-Reply-To: <1493175912-3391-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 26 Apr 2017 11:05:12 +0800

> This patch updates the comment for netif_dormant() function to reflect
> the intended usage.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-27 20:19 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Boris Ostrovsky, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
	Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <20170426073720.okv33ly2ldepilti-aUbyMND+kyB2Oba8jWPag5QscXo+jHNAQQ4Iyu8u01E@public.gmane.org>



On 26/04/17 01:37 AM, Roger Pau Monné wrote:
> On Tue, Apr 25, 2017 at 12:21:02PM -0600, Logan Gunthorpe wrote:
>> Straightforward conversion to the new helper, except due to the lack
>> of error path, we have to use SG_MAP_MUST_NOT_FAIL which may BUG_ON in
>> certain cases in the future.
>>
>> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
>> Cc: Boris Ostrovsky <boris.ostrovsky-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Cc: Juergen Gross <jgross-IBi9RG/b67k@public.gmane.org>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Cc: "Roger Pau Monné" <roger.pau-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
>> ---
>>  drivers/block/xen-blkfront.c | 20 +++++++++++---------
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 3945963..ed62175 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -816,8 +816,9 @@ static int blkif_queue_rw_req(struct request *req, struct blkfront_ring_info *ri
>>  		BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>>  
>>  		if (setup.need_copy) {
>> -			setup.bvec_off = sg->offset;
>> -			setup.bvec_data = kmap_atomic(sg_page(sg));
>> +			setup.bvec_off = 0;
>> +			setup.bvec_data = sg_map(sg, 0, SG_KMAP_ATOMIC |
>> +						 SG_MAP_MUST_NOT_FAIL);
> 
> I assume that sg_map already adds sg->offset to the address?

Correct.

> Also wondering whether we can get rid of bvec_off and just increment bvec_data,
> adding Julien who IIRC added this code.

bvec_off is used to keep track of the offset within the current mapping
so it's not a great idea given that you'd want to kunmap_atomic the
original address and not something with an offset. It would be nice if
this could be converted to use the sg_miter interface but that's a much
more invasive change that would require someone who knows this code and
can properly test it. I'd be very grateful if someone actually took that on.

Logan

^ permalink raw reply

* Re: [PATCH net-next 11/18] net: dsa: mv88e6xxx: get STU entry on VTU GetNext
From: Andrew Lunn @ 2017-04-27 20:17 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20170426155336.5937-12-vivien.didelot@savoirfairelinux.com>

On Wed, Apr 26, 2017 at 11:53:29AM -0400, Vivien Didelot wrote:
> Now that the code reads both VTU and STU data on VTU GetNext operation,
> fetch the STU entry data of a VTU entry at the same time.
> 
> The STU data bits are masked with the VTU data bits and they are now all
> read at the same time a VTU GetNext operation is issued.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions
From: Logan Gunthorpe @ 2017-04-27 20:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
	Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <20170426074416.GA7936-jcswGhMUV9g@public.gmane.org>


On 26/04/17 01:44 AM, Christoph Hellwig wrote:
> I think we'll at least need a draft of those to make sense of these
> patches.  Otherwise they just look very clumsy.

Ok, what follows is a draft patch attempting to show where I'm thinking
of going with this. Obviously it will not compile because it assumes
the users throughout the kernel are a bit different than they are today.
Notably, there is no sg_page anymore.

There's also likely a ton of issues and arguments to have over a bunch
of the specifics below and I'd expect the concept to evolve more
as cleanup occurs. This itself is an evolution of the draft I posted
replying to you in my last RFC thread.

Also, before any of this is truly useful to us, pfn_t would have to
infect a few other places in the kernel.

Thanks,

Logan


diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index fad170b..85ef928 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -6,13 +6,14 @@
 #include <linux/bug.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include <linux/pfn_t.h>
 #include <asm/io.h>

 struct scatterlist {
 #ifdef CONFIG_DEBUG_SG
 	unsigned long	sg_magic;
 #endif
-	unsigned long	page_link;
+	pfn_t  		pfn;
 	unsigned int	offset;
 	unsigned int	length;
 	dma_addr_t	dma_address;
@@ -60,15 +61,68 @@ struct sg_table {

 #define SG_MAGIC	0x87654321

-/*
- * We overload the LSB of the page pointer to indicate whether it's
- * a valid sg entry, or whether it points to the start of a new
scatterlist.
- * Those low bits are there for everyone! (thanks mason :-)
- */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
-#define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+static inline bool sg_is_chain(struct scatterlist *sg)
+{
+	return sg->pfn.val & PFN_SG_CHAIN;
+}
+
+static inline bool sg_is_last(struct scatterlist *sg)
+{
+	return sg->pfn.val & PFN_SG_LAST;
+}
+
+static inline struct scatterlist *sg_chain_ptr(struct scatterlist *sg)
+{
+	unsigned long sgl = pfn_t_to_pfn(sg->pfn);
+	return (struct scatterlist *)(sgl << PAGE_SHIFT);
+}
+
+static inline bool sg_is_iomem(struct scatterlist *sg)
+{
+	return pfn_t_is_iomem(sg->pfn);
+}
+
+/**
+ * sg_assign_pfn - Assign a given pfn_t to an SG entry
+ * @sg:		    SG entry
+ * @pfn:	    The pfn
+ *
+ * Description:
+ *   Assign a pfn to sg entry. Also see sg_set_pfn(), the most commonly
used
+ *   variant.w
+ *
+ **/
+static inline void sg_assign_pfn(struct scatterlist *sg, pfn_t pfn)
+{
+#ifdef CONFIG_DEBUG_SG
+	BUG_ON(sg->sg_magic != SG_MAGIC);
+	BUG_ON(sg_is_chain(sg));
+	BUG_ON(pfn.val & (PFN_SG_CHAIN | PFN_SG_LAST));
+#endif
+
+	sg->pfn = pfn;
+}
+
+/**
+ * sg_set_pfn - Set sg entry to point at given pfn
+ * @sg:		 SG entry
+ * @pfn:	 The page
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Use this function to set an sg entry pointing at a pfn, never assign
+ *   the page directly. We encode sg table information in the lower bits
+ *   of the page pointer. See sg_pfn_t for looking up the pfn_t belonging
+ *   to an sg entry.
+ **/
+static inline void sg_set_pfn(struct scatterlist *sg, pfn_t pfn,
+			      unsigned int len, unsigned int offset)
+{
+	sg_assign_pfn(sg, pfn);
+	sg->offset = offset;
+	sg->length = len;
+}

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -82,18 +136,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	if (!page) {
+		pfn_t null_pfn = {0};
+		sg_assign_pfn(sg, null_pfn);
+		return;
+	}

-	/*
-	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
-	 */
-	BUG_ON((unsigned long) page & 0x03);
-#ifdef CONFIG_DEBUG_SG
-	BUG_ON(sg->sg_magic != SG_MAGIC);
-	BUG_ON(sg_is_chain(sg));
-#endif
-	sg->page_link = page_link | (unsigned long) page;
+	sg_assign_pfn(sg, page_to_pfn_t(page));
 }

 /**
@@ -106,8 +155,7 @@ static inline void sg_assign_page(struct scatterlist
*sg, struct page *page)
  * Description:
  *   Use this function to set an sg entry pointing at a page, never assign
  *   the page directly. We encode sg table information in the lower bits
- *   of the page pointer. See sg_page() for looking up the page belonging
- *   to an sg entry.
+ *   of the page pointer.
  *
  **/
 static inline void sg_set_page(struct scatterlist *sg, struct page *page,
@@ -118,13 +166,53 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
 	sg->length = len;
 }

-static inline struct page *sg_page(struct scatterlist *sg)
+/**
+ * sg_pfn_t - Return the pfn_t for the sg
+ * @sg:		 SG entry
+ *
+ **/
+static inline pfn_t sg_pfn_t(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+
+	return sg->pfn;
+}
+
+/**
+ * sg_to_mappable_page - Try to return a struct page safe for general
+ *	use in the kernel
+ * @sg:		 SG entry
+ * @page:	 A pointer to the returned page
+ *
+ * Description:
+ *   If possible, return a mappable page that's safe for use around the
+ *   kernel. Should only be used in legacy situations. sg_pfn_t() is a
+ *   better choice for new code. This is deliberately more awkward than
+ *   the old sg_page to enforce the __must_check rule and discourage future
+ *   use.
+ *
+ *   An example where this is required is in nvme-fabrics: a page from an
+ *   sgl is placed into a bio. This function would be required until we can
+ *   convert bios to use pfn_t as well. Similar issues with skbs, etc.
+ **/
+static inline __must_check int sg_to_mappable_page(struct scatterlist *sg,
+						   struct page **ret)
+{
+	struct page *pg;
+
+	if (unlikely(sg_is_iomem(sg)))
+		return -EFAULT;
+
+	pg = pfn_t_to_page(sg->pfn);
+	if (unlikely(!pg))
+		return -EFAULT;
+
+	*ret = pg;
+
+	return 0;
 }

 #define SG_KMAP		     (1 << 0)	/* create a mapping with kmap */
@@ -167,8 +255,19 @@ static inline void *sg_map(struct scatterlist *sg,
size_t offset, int flags)
 	unsigned int pg_off;
 	void *ret;

+	if (unlikely(sg_is_iomem(sg))) {
+		ret = ERR_PTR(-EFAULT);
+		goto out;
+	}
+
+	pg = pfn_t_to_page(sg->pfn);
+	if (unlikely(!pg)) {
+		ret = ERR_PTR(-EFAULT);
+		goto out;
+	}
+
 	offset += sg->offset;
-	pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT);
+	pg = nth_page(pg, offset >> PAGE_SHIFT);
 	pg_off = offset_in_page(offset);

 	if (flags & SG_KMAP_ATOMIC)
@@ -178,12 +277,7 @@ static inline void *sg_map(struct scatterlist *sg,
size_t offset, int flags)
 	else
 		ret = ERR_PTR(-EINVAL);

-	/*
-	 * In theory, this can't happen yet. Once we start adding
-	 * unmapable memory, it also shouldn't happen unless developers
-	 * start putting unmappable struct pages in sgls and passing
-	 * it to code that doesn't support it.
-	 */
+out:
 	BUG_ON(flags & SG_MAP_MUST_NOT_FAIL && IS_ERR(ret));

 	return ret;
@@ -202,9 +296,15 @@ static inline void *sg_map(struct scatterlist *sg,
size_t offset, int flags)
 static inline void sg_unmap(struct scatterlist *sg, void *addr,
 			    size_t offset, int flags)
 {
-	struct page *pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT);
+	struct page *pg;
 	unsigned int pg_off = offset_in_page(offset);

+	pg = pfn_t_to_page(sg->pfn);
+	if (unlikely(!pg))
+		return;
+
+	pg = nth_page(pg, offset >> PAGE_SHIFT);
+
 	if (flags & SG_KMAP_ATOMIC)
 		kunmap_atomic(addr - sg->offset - pg_off);
 	else if (flags & SG_KMAP)
@@ -246,17 +346,18 @@ static inline void sg_set_buf(struct scatterlist
*sg, const void *buf,
 static inline void sg_chain(struct scatterlist *prv, unsigned int
prv_nents,
 			    struct scatterlist *sgl)
 {
+	pfn_t pfn;
+	unsigned long _sgl = (unsigned long) sgl;
+
 	/*
 	 * offset and length are unused for chain entry.  Clear them.
 	 */
 	prv[prv_nents - 1].offset = 0;
 	prv[prv_nents - 1].length = 0;

-	/*
-	 * Set lowest bit to indicate a link pointer, and make sure to clear
-	 * the termination bit if it happens to be set.
-	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	BUG_ON(_sgl & PAGE_MASK);
+	pfn = __pfn_to_pfn_t(_sgl >> PAGE_SHIFT, PFN_SG_CHAIN);
+	prv[prv_nents - 1].pfn = pfn;
 }

 /**
@@ -276,8 +377,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->pfn.val |= PFN_SG_LAST;
+	sg->pfn.val &= ~PFN_SG_CHAIN;
 }

 /**
@@ -293,7 +394,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->pfn.val &= ~PFN_SG_LAST;
 }

 /**
@@ -301,14 +402,13 @@ static inline void sg_unmark_end(struct
scatterlist *sg)
  * @sg:	     SG entry
  *
  * Description:
- *   This calls page_to_phys() on the page in this sg entry, and adds the
- *   sg offset. The caller must know that it is legal to call
page_to_phys()
- *   on the sg page.
+ *   This calls pfn_t_to_phys() on the pfn in this sg entry, and adds the
+ *   sg offset.
  *
  **/
 static inline dma_addr_t sg_phys(struct scatterlist *sg)
 {
-	return page_to_phys(sg_page(sg)) + sg->offset;
+	return pfn_t_to_phys(sg->pfn) + sg->offset;
 }

 /**
@@ -323,7 +423,12 @@ static inline dma_addr_t sg_phys(struct scatterlist
*sg)
  **/
 static inline void *sg_virt(struct scatterlist *sg)
 {
-	return page_address(sg_page(sg)) + sg->offset;
+	struct page *pg = pfn_t_to_page(sg->pfn);
+
+	BUG_ON(sg_is_iomem(sg));
+	BUG_ON(!pg);
+
+	return page_address(pg) + sg->offset;
 }

 int sg_nents(struct scatterlist *sg);
@@ -422,10 +527,18 @@ void __sg_page_iter_start(struct sg_page_iter *piter,
 /**
  * sg_page_iter_page - get the current page held by the page iterator
  * @piter:	page iterator holding the page
+ *
+ * This function will require some cleanup. Some users simply mark
+ * attributes of the pages which are fine, others actually map it and
+ * will require some saftey there.
  */
 static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
 {
-	return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
+	struct page *pg = pfn_t_to_page(piter->sg->pfn);
+	if (!pg)
+		return NULL;
+
+	return nth_page(pg, piter->sg_pgoffset);
 }

 /**
@@ -468,11 +581,13 @@ static inline dma_addr_t
sg_page_iter_dma_address(struct sg_page_iter *piter)
 #define SG_MITER_ATOMIC		(1 << 0)	 /* use kmap_atomic */
 #define SG_MITER_TO_SG		(1 << 1)	/* flush back to phys on unmap */
 #define SG_MITER_FROM_SG	(1 << 2)	/* nop */
+#define SG_MITER_SUPPORTS_IOMEM (1 << 3)        /* iteratee supports
iomem */

 struct sg_mapping_iter {
 	/* the following three fields can be accessed directly */
 	struct page		*page;		/* currently mapped page */
 	void			*addr;		/* pointer to the mapped area */
+	void __iomem            *ioaddr;        /* pointer to iomem */
 	size_t			length;		/* length of the mapped area */
 	size_t			consumed;	/* number of consumed bytes */
 	struct sg_page_iter	piter;		/* page iterator */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..2d1c58c 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -571,6 +571,8 @@ EXPORT_SYMBOL(sg_miter_skip);
  */
 bool sg_miter_next(struct sg_mapping_iter *miter)
 {
+	void *addr;
+
 	sg_miter_stop(miter);

 	/*
@@ -580,13 +582,25 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
 	if (!sg_miter_get_next_page(miter))
 		return false;

+	if (sg_is_iomem(miter->piter.sg) &&
+	    !(miter->__flags & SG_MITER_SUPPORTS_IOMEM))
+		return false;
+
 	miter->page = sg_page_iter_page(&miter->piter);
 	miter->consumed = miter->length = miter->__remaining;

 	if (miter->__flags & SG_MITER_ATOMIC)
-		miter->addr = kmap_atomic(miter->page) + miter->__offset;
+		addr = kmap_atomic(miter->page) + miter->__offset;
 	else
-		miter->addr = kmap(miter->page) + miter->__offset;
+		addr = kmap(miter->page) + miter->__offset;
+
+	if (sg_is_iomem(miter->piter.sg)) {
+		miter->addr = NULL;
+		miter->ioaddr = (void * __iomem) addr;
+	} else {
+		miter->addr = addr;
+		miter->ioaddr = NULL;
+	}

 	return true;
 }
@@ -651,7 +665,7 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
-	unsigned int sg_flags = SG_MITER_ATOMIC;
+	unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_SUPPORTS_IOMEM;

 	if (to_buffer)
 		sg_flags |= SG_MITER_FROM_SG;
@@ -668,10 +682,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,

 		len = min(miter.length, buflen - offset);

-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (miter.addr) {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		} else if (miter.ioaddr) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset, miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		}

 		offset += len;
 	}

^ permalink raw reply related

* Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages
From: David Ahern @ 2017-04-27 19:59 UTC (permalink / raw)
  To: vyasevic, Vladislav Yasevich, netdev; +Cc: roopa, Jiri Pirko
In-Reply-To: <7b5396ae-0cf3-2a1e-9c49-5d6f031adf58@redhat.com>

On 4/27/17 1:43 PM, Vlad Yasevich wrote:
>> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
>> about the name suggests it is a bonding notification. This one was added
>> specifically to notify userspace (d4261e5650004), yet seems to happen
>> only during a changelink and that already generates a RTM_NEWLINK
>> message via do_setlink. Since the rtnetlink_event message does not
>> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
>> really serve besides duplicating netlink messages to userspace.
>>
> 
> I am not sure about this one, but if you have an app trying to monitor
> for this event, it can't really since there is no info in the netlink message.

I cc'ed Jiri on this thread hoping he would explain the intent.

I propose it gets removed.

^ permalink raw reply

* Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages
From: Vlad Yasevich @ 2017-04-27 19:51 UTC (permalink / raw)
  To: Roopa Prabhu, David Ahern
  Cc: Vladislav Yasevich, netdev@vger.kernel.org, Jiri Pirko
In-Reply-To: <CAJieiUi6Uu-=QBMfBXOjrEDXLCZx=mFo9SyLhgb+CcG2=uiyRA@mail.gmail.com>

On 04/24/2017 11:14 AM, Roopa Prabhu wrote:
> On Sun, Apr 23, 2017 at 6:07 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>>
>> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>>> @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
>>>       return err;
>>>  }
>>>
>>> +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
>>> +{
>>> +     u32 rtnl_event;
>>> +
>>> +     switch (event) {
>>> +     case NETDEV_REBOOT:
>>> +             rtnl_event = IFLA_EVENT_REBOOT;
>>> +             break;
>>> +     case NETDEV_FEAT_CHANGE:
>>> +             rtnl_event = IFLA_EVENT_FEAT_CHANGE;
>>> +             break;
>>> +     case NETDEV_BONDING_FAILOVER:
>>> +             rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>>> +             break;
>>> +     case NETDEV_NOTIFY_PEERS:
>>> +             rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
>>> +             break;
>>> +     case NETDEV_RESEND_IGMP:
>>> +             rtnl_event = IFLA_EVENT_RESEND_IGMP;
>>> +             break;
>>> +     case NETDEV_CHANGEINFODATA:
>>> +             rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>>> +             break;
>>> +     default:
>>> +             return 0;
>>> +     }
>>> +
>>> +     return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
>>> +}
>>> +
>>
>> I still have doubts about encoding kernel events into a uapi.
> 
> agree. I don't see why user-space will need NETDEV_CHANGEINFODATA and
> others david listed.
> 

Well, I am not sure about CHANGEINFODATA as well, but I can see use
cases for others.

> My other concerns are, once we have this exposed to user-space and
> user-space starts relying on it, it will need accurate information and
> will expect to have this event information all the time.
> IIUC, we cannot cover multiple events in a single notification and not
> all link notifications will contain an IFLA_EVENT attribute.

Uhm...  If the rtnetlink message was a result of an event, it will have
an IFLA_EVENT.  If a message is something else, then it will not have
an event.  That's the point.  Not all netlink attributes are in every
netlink message.

> In other
> words, we will be telling user-space to not expect that the kernel
> will send IFLA_EVENT every time.
> 

No, we are telling the user that if it is interested in a specific event
(let's say NOTIFY_PEERS or RESEND_IGMP), then it now can monitor netlink
traffic for those events.
As things stand right now, that's not possible.

I've done this specifically for all events for which we currently generate
a netlink message.

The only concern I have is that if in the future we remove a certain netdev
event, it may impact applications.  But we may be doing it right now as well,
only silently, and the apps may have to find some ways to work around it.

There is also a potential to improve libnl caching and not invalidate the
cached data for certain events.

-vlad
> 
> 
>>
>> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
>> about the name suggests it is a bonding notification. This one was added
>> specifically to notify userspace (d4261e5650004), yet seems to happen
>> only during a changelink and that already generates a RTM_NEWLINK
>> message via do_setlink. Since the rtnetlink_event message does not
>> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
>> really serve besides duplicating netlink messages to userspace.
>>
>> The REBOOT, IGMP, FEAT_CHANGE and BONDING_FAILOVER seem to be unique
>> messages (code analysis only) which I get for notifying userspace.
>>
>> NETDEV_NOTIFY_PEERS is not so clear in how often it duplicates other
>> messages.

^ permalink raw reply

* Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages
From: Vlad Yasevich @ 2017-04-27 19:43 UTC (permalink / raw)
  To: David Ahern, Vladislav Yasevich, netdev; +Cc: roopa, Jiri Pirko
In-Reply-To: <877efb54-2aef-4d1e-c0b4-2ce6aa6562df@cumulusnetworks.com>

On 04/23/2017 09:07 PM, David Ahern wrote:
> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>> @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
>>  	return err;
>>  }
>>  
>> +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
>> +{
>> +	u32 rtnl_event;
>> +
>> +	switch (event) {
>> +	case NETDEV_REBOOT:
>> +		rtnl_event = IFLA_EVENT_REBOOT;
>> +		break;
>> +	case NETDEV_FEAT_CHANGE:
>> +		rtnl_event = IFLA_EVENT_FEAT_CHANGE;
>> +		break;
>> +	case NETDEV_BONDING_FAILOVER:
>> +		rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>> +		break;
>> +	case NETDEV_NOTIFY_PEERS:
>> +		rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
>> +		break;
>> +	case NETDEV_RESEND_IGMP:
>> +		rtnl_event = IFLA_EVENT_RESEND_IGMP;
>> +		break;
>> +	case NETDEV_CHANGEINFODATA:
>> +		rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>> +		break;
>> +	default:
>> +		return 0;
>> +	}
>> +
>> +	return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
>> +}
>> +
> 
> I still have doubts about encoding kernel events into a uapi.
> 
> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
> about the name suggests it is a bonding notification. This one was added
> specifically to notify userspace (d4261e5650004), yet seems to happen
> only during a changelink and that already generates a RTM_NEWLINK
> message via do_setlink. Since the rtnetlink_event message does not
> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
> really serve besides duplicating netlink messages to userspace.
> 

I am not sure about this one, but if you have an app trying to monitor
for this event, it can't really since there is no info in the netlink message.

> The REBOOT, IGMP, FEAT_CHANGE and BONDING_FAILOVER seem to be unique
> messages (code analysis only) which I get for notifying userspace.
> 
> NETDEV_NOTIFY_PEERS is not so clear in how often it duplicates other
> messages.
> 

This one sometimes happens in addition to bonding failover, but not always
(it depends on bonding mode).
For me, having access to this particular event is important as it will
used to trigger a guest announcements.

-vlad

^ permalink raw reply

* Re: [PATCH net-next 10/18] net: dsa: mv88e6xxx: move STU GetNext operation
From: Andrew Lunn @ 2017-04-27 19:43 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20170426155336.5937-11-vivien.didelot@savoirfairelinux.com>

On Wed, Apr 26, 2017 at 11:53:28AM -0400, Vivien Didelot wrote:
> Extract the generic portion of code to issue an STU GetNext operation,
> which will be used in other implementations.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [PATCH net 2/2] vxlan: do not output confusing error message
From: Jiri Benc @ 2017-04-27 19:24 UTC (permalink / raw)
  To: netdev; +Cc: Marcelo Ricardo Leitner
In-Reply-To: <cover.1493320999.git.jbenc@redhat.com>

The message "Cannot bind port X, err=Y" creates only confusion. In metadata
based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
no error message should be printed. But when IPv6 tunnel was requested, such
failure is fatal. The vxlan_socket_create does not know when the error is
harmless and when it's not.

Instead of passing such information down to vxlan_socket_create, remove the
message completely. It's not useful. We propagate the error code up to the
user space and the port number comes from the user space. There's nothing in
the message that the process creating vxlan interface does not know.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 drivers/net/vxlan.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 118e508f1889..27cba699d83a 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2754,8 +2754,6 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,

 	sock = vxlan_create_sock(net, ipv6, port, flags);
 	if (IS_ERR(sock)) {
-		pr_info("Cannot bind port %d, err=%ld\n", ntohs(port),
-			PTR_ERR(sock));
 		kfree(vs);
 		return ERR_CAST(sock);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net 1/2] vxlan: correctly handle ipv6.disable module parameter
From: Jiri Benc @ 2017-04-27 19:24 UTC (permalink / raw)
  To: netdev; +Cc: Marcelo Ricardo Leitner
In-Reply-To: <cover.1493320999.git.jbenc@redhat.com>

When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.

Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.

Fixes: b1be00a6c39f ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 drivers/net/vxlan.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index bdb6ae16d4a8..118e508f1889 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2818,17 +2818,21 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
 
 static int vxlan_sock_add(struct vxlan_dev *vxlan)
 {
-	bool ipv6 = vxlan->flags & VXLAN_F_IPV6;
 	bool metadata = vxlan->flags & VXLAN_F_COLLECT_METADATA;
+	bool ipv6 = vxlan->flags & VXLAN_F_IPV6 || metadata;
+	bool ipv4 = !ipv6 || metadata;
 	int ret = 0;
 
 	RCU_INIT_POINTER(vxlan->vn4_sock, NULL);
 #if IS_ENABLED(CONFIG_IPV6)
 	RCU_INIT_POINTER(vxlan->vn6_sock, NULL);
-	if (ipv6 || metadata)
+	if (ipv6) {
 		ret = __vxlan_sock_add(vxlan, true);
+		if (ret < 0 && ret != -EAFNOSUPPORT)
+			ipv4 = false;
+	}
 #endif
-	if (!ret && (!ipv6 || metadata))
+	if (ipv4)
 		ret = __vxlan_sock_add(vxlan, false);
 	if (ret < 0)
 		vxlan_sock_release(vxlan);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net 0/2] vxlan: do not error out on disabled IPv6
From: Jiri Benc @ 2017-04-27 19:24 UTC (permalink / raw)
  To: netdev; +Cc: Marcelo Ricardo Leitner

This patchset fixes a bug with metadata based tunnels when booted with
ipv6.disable=1.


Jiri Benc (2):
  vxlan: correctly handle ipv6.disable module parameter
  vxlan: do not output confusing error message

 drivers/net/vxlan.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH net-next 1/2] rtnetlink: Disable notification for NETDEV_NAMECHANGE event
From: Vlad Yasevich @ 2017-04-27 19:26 UTC (permalink / raw)
  To: David Ahern, Vladislav Yasevich, netdev
In-Reply-To: <becfb3d4-5e9d-800b-fc67-1e99f4b14db9@cumulusnetworks.com>

On 04/21/2017 02:08 PM, David Ahern wrote:
> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>> The data signaling name change is already provided at
>> the end of do_setlink().  This event handler just generates
>> a duplicate announcement.  Disable it.
>>
>> CC: David Ahern <dsa@cumulusnetworks.com>
>> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
>> ---
>>  net/core/rtnetlink.c | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index 0ee5479..e8e6816 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -4123,7 +4123,6 @@ static int rtnetlink_event(struct notifier_block *this, unsigned long event, voi
>>  
>>  	switch (event) {
>>  	case NETDEV_REBOOT:
>> -	case NETDEV_CHANGENAME:
>>  	case NETDEV_FEAT_CHANGE:
>>  	case NETDEV_BONDING_FAILOVER:
>>  	case NETDEV_NOTIFY_PEERS:
>>
> 
> 
> I only see one using the ip monitor.
> 
> $ ip li set foobar name fubar
> 
> generates these 3 messages:
> 
> [LINK]12: fubar: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN
> group default
>     link/ether 76:cd:72:dd:2a:cb brd ff:ff:ff:ff:ff:ff
> Unknown message: type=0x00000051(81) flags=0x00000000(0)len=0x0000001c(28)
> [NETCONF]ipv4 dev dummy2 forwarding on rp_filter off mc_forwarding off
> proxy_neigh off ignore_routes_with_linkdown off
> Unknown message: type=0x00000051(81) flags=0x00000000(0)len=0x0000001c(28)
> [NETCONF]ipv6 dev dummy2 forwarding on mc_forwarding off proxy_neigh off
> ignore_routes_with_linkdown off
> 
> do_setlink only sets DO_SETLINK_MODIFIED so a name change alone will not
> generate 2 messages.
> 

Actually, it has nothing to do with above flag.  Setting DO_SETLINK_MODIFIED
will still generate notifications, but only if the device is UP.  However,
it looks like link name change can only be done when the link is down.  As
a result, netdev_state_change will not report it, so we only see the 'event'
one.

So this is patch isn't needed, but only as a kind-of side-effect..

-vlad

^ permalink raw reply

* Re: [PATCH net-next 0/9] support unique MAC addresses for slave devices
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-04-27 19:21 UTC (permalink / raw)
  To: Marco Chiappero
  Cc: linux-netdev, David S . Miller, Jeff Kirsher, Alexander Duyck,
	Sainath Grandhi
In-Reply-To: <20170427145142.15830-1-marco.chiappero@intel.com>

On Thu, Apr 27, 2017 at 7:51 AM, Marco Chiappero
<marco.chiappero@intel.com> wrote:
> Currently every slave device gets assigned the same MAC address, by
> having it copied from the master interface. Since some code paths
> depend on this identity, changing the MAC address on slave interfaces
> is not supported. However identical MAC addresses can pose problems to
> management and orchestration software that correctly expect network
> interfaces on the same segment to have unique addresses.
>
Please understand that there are two distinct drivers IPvlan and
MACvlan. They both exist together for good reasons and are trying to
cater for different needs. I would love to combine them together if we
don't mess / miss the goodies each of them have to offer... otherwise
*NO*! Having said that if management / orchestration software has
problems then clearly you should not use IPvlan for that use case.

> Patches 1-8 include style fixes and refactoring (patch 9 depends upon)
> that improve the overal quality and make the intruduction of the
> feature straightforward.
>
Lots of this fall into I-say-potato-you-say-... category. My way of
thinking / organizing code is different than yours and you don't have
to like mine and I don't have to like yours.

> Patch 9 enables slave devices to own unique MAC addresses and change
> such addresses live, fixing lack of support and a related bug, as
> MAC address changes on master were not propagated to slave devices.
> In order to preserve the main peculiarity of this driver, that is
> exposing only a single MAC address for outbound traffic, frames
> egressing from master are now effectively masquerated when working in
> L2 mode.
>
This enhancement is, however, coming via packet-header rewrite for
every Tx/Rx packet which defeats the purpose. The only good thing that
came in light is the mac-addr change propagation from master issue;
but if the fix is coming as a side-effect of header rewrite then it's
not an acceptable fix either. This can be simply fixed by changing a
line in ipvlan_hard_header().

> Marco Chiappero (9):
>   ipvlan: fix coding style for the ipvlan tree
>   ipvlan: refactor ipvlan_process_multicast for readability
>   ipvlan: replace ipvlan_rcv_frame
>   ipvlan: rework the IP lookup function
>   ipvlan: improve and uniform naming
>   ipvlan: reposition three functions
>   ipvlan: relocate ipvlan_skb_crossing_ns calls
>   ipvlan: improve compiler hints
>   ipvlan: introduce individual MAC addresses
>
>  drivers/net/ipvlan/ipvlan.h      |   2 +-
>  drivers/net/ipvlan/ipvlan_core.c | 592 ++++++++++++++++++++-------------------
>  drivers/net/ipvlan/ipvlan_main.c |  49 ++--
>  drivers/net/ipvlan/ipvtap.c      |   1 +
>  4 files changed, 333 insertions(+), 311 deletions(-)
>
> --
> 2.9.3
>
> --------------------------------------------------------------
> Intel Research and Development Ireland Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
> Registered Number: 308263
>
>
> This e-mail and any attachments may contain confidential material for the sole
> use of the intended recipient(s). Any review or distribution by others is
> strictly prohibited. If you are not the intended recipient, please contact the
> sender and delete all copies.
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox