Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCHv2 2/2] Don't compute checksum value for SCTP skb with, CHECKSUM_PARTIAL set
From: Fan Du @ 2013-10-14  7:16 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Neil Horman, steffen.klassert, davem, netdev
In-Reply-To: <b4c6105a-057c-4fd7-9b49-99df902447d8@email.android.com>



On 2013年10月12日 21:06, Vlad Yasevich wrote:
>
>
> Fan Du<fan.du@windriver.com>  wrote:
>
>>
>>
>> On 2013年10月11日 22:25, Vlad Yasevich wrote:
>>> On 10/11/2013 03:08 AM, Fan Du wrote:
>>>>
>>>>
>>>> On 2013年10月10日 21:11, Neil Horman wrote:
>>>>> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>>>>>> igb/ixgbe have hardware sctp checksum support, when this feature
>> is
>>>>>> enabled
>>>>>> and also IPsec is armed to protect sctp traffic, ugly things
>> happened as
>>>>>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum
>>>>>> every thing
>>>>>> up and pack the 16bits result in the checksum field). The result
>> is fail
>>>>>> establishment of sctp communication.
>>>>>>
>>>>> Shouldn't this be fixed in the xfrm code then? E.g. check the
>> device
>>>>> features
>>>>> for SCTP checksum offloading and and skip the checksum during xfrm
>>>>> output if its
>>>>> available?
>>>>>
>>>>> Or am I missing something?
>>>>> Neil
>>>>>
>>>>>
>>>>
>>>>
>>>>  From cff25810910603ff991f0c56441d6de8dc33a822 Mon Sep 17 00:00:00
>> 2001
>>>> From: Fan Du<fan.du@windriver.com>
>>>> Date: Fri, 11 Oct 2013 14:31:57 +0800
>>>> Subject: [PATCH 2/2] Don't compute checksum value for SCTP skb with
>>>> CHECKSUM_PARTIAL set
>>>> MIME-Version: 1.0
>>>> Content-Type: text/plain; charset=UTF-8
>>>> Content-Transfer-Encoding: 8bit
>>>>
>>>> IPsec is not enabled in this scenario:
>>>>
>>>> SCTP skb set CHECKSUM_PARTIAL to indicate hardware is capable of
>> doing
>>>> SCTP checksum(crc32-c) scoping the whole SCTP packet range. However
>> when
>>>> such kind of skb is delivered through IPv4/v6 output handler,
>> IPv4/v6 stack
>>>> interpret CHECKSUM_PARTIAL by calling skb_checksum_help to compute
>> 16bits
>>>> checksum value by summing everything up, the whole SCTP packet in
>> software
>>>> manner！ After this skb reach NIC, after hardware doing its SCTP
>> checking
>>>> business, a crc32-c value will overwrite the value IPv4/v6 stack
>> computed
>>>> before.
>>>>
>>>> This patch solves this by introducing skb_is_sctpv4/6 to optimize
>> such
>>>> case.
>>>>
>>>> Signed-off-by: Fan Du<fan.du@windriver.com>
>>>> ---
>>>> v2:
>>>> Rework this problem by introducing skb_is_scktv4/6
>>>>
>>>> ---
>>>> include/linux/ip.h | 5 +++++
>>>> include/linux/ipv6.h | 6 ++++++
>>>> include/linux/skbuff.h | 1 -
>>>> net/core/skbuff.c | 1 +
>>>> net/ipv4/ip_output.c | 4 +++-
>>>> net/ipv6/ip6_output.c | 1 +
>>>> net/xfrm/xfrm_output.c | 20 +++++++++++++++-----
>>>> 7 files changed, 31 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/ip.h b/include/linux/ip.h
>>>> index 492bc65..f556292 100644
>>>> --- a/include/linux/ip.h
>>>> +++ b/include/linux/ip.h
>>>> @@ -19,6 +19,7 @@
>>>>
>>>> #include<linux/skbuff.h>
>>>> #include<uapi/linux/ip.h>
>>>> +#include<uapi/linux/in.h>
>>>>
>>>> static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
>>>> {
>>>> @@ -34,4 +35,8 @@ static inline struct iphdr *ipip_hdr(const struct
>>>> sk_buff *skb)
>>>> {
>>>> return (struct iphdr *)skb_transport_header(skb);
>>>> }
>>>> +static inline int skb_is_sctpv4(const struct sk_buff *skb)
>>>> +{
>>>> + return ip_hdr(skb)->protocol == IPPROTO_SCTP;
>>>> +}
>>>> #endif /* _LINUX_IP_H */
>>>> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
>>>> index 28ea384..6e17c04 100644
>>>> --- a/include/linux/ipv6.h
>>>> +++ b/include/linux/ipv6.h
>>>> @@ -2,6 +2,7 @@
>>>> #define _IPV6_H
>>>>
>>>> #include<uapi/linux/ipv6.h>
>>>> +#include<uapi/linux/in.h>
>>>>
>>>> #define ipv6_optlen(p) (((p)->hdrlen+1)<<  3)
>>>> /*
>>>> @@ -387,4 +388,9 @@ static inline struct raw6_sock *raw6_sk(const
>> struct
>>>> sock *sk)
>>>> ((__sk)->sk_bound_dev_if == (__dif)))&&  \
>>>> net_eq(sock_net(__sk), (__net)))
>>>>
>>>> +static inline int skb_is_sctpv6(const struct sk_buff *skb)
>>>> +{
>>>> + return ipv6_hdr(skb)->nexthdr == IPPROTO_SCTP;
>>>> +}
>>>> +
>>>> #endif /* _IPV6_H */
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 2ddb48d..b36d0cc 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -2393,7 +2393,6 @@ extern void skb_split(struct sk_buff *skb,
>>>> extern int skb_shift(struct sk_buff *tgt, struct sk_buff *skb,
>>>> int shiftlen);
>>>> extern void skb_scrub_packet(struct sk_buff *skb, bool xnet);
>>>> -
>>>> extern struct sk_buff *skb_segment(struct sk_buff *skb,
>>>> netdev_features_t features);
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index d81cff1..54d6172 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -3526,3 +3526,4 @@ void skb_scrub_packet(struct sk_buff *skb,
>> bool xnet)
>>>> nf_reset_trace(skb);
>>>> }
>>>> EXPORT_SYMBOL_GPL(skb_scrub_packet);
>>>> +
>>>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>>> index a04d872..8676b2c 100644
>>>> --- a/net/ipv4/ip_output.c
>>>> +++ b/net/ipv4/ip_output.c
>>>> @@ -587,7 +587,9 @@ slow_path_clean:
>>>>
>>>> slow_path:
>>>> /* for offloaded checksums cleanup checksum before fragmentation */
>>>> - if ((skb->ip_summed == CHECKSUM_PARTIAL)&&
>> skb_checksum_help(skb))
>>>> + if ((skb->ip_summed == CHECKSUM_PARTIAL)&&
>>>> + !skb_is_sctpv4(skb)&&
>>>> + skb_checksum_help(skb))
>>>> goto fail;
>>>> iph = ip_hdr(skb);
>>>>
>>>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>>>> index 3a692d5..9b27d95 100644
>>>> --- a/net/ipv6/ip6_output.c
>>>> +++ b/net/ipv6/ip6_output.c
>>>> @@ -671,6 +671,7 @@ slow_path_clean:
>>>>
>>>> slow_path:
>>>> if ((skb->ip_summed == CHECKSUM_PARTIAL)&&
>>>> + !skb_is_sctpv6(skb)&&
>>>> skb_checksum_help(skb))
>>>> goto fail;
>>>>
>>>
>>> No, this isn't right. This is a case where IP fragmentation is
>>> required, and the above code will cause SCTP checksum to not be
>>> computed.
>>
>> Ok, I got my ball, ten bucks bet this correct ;)
>>
>> IPv4:
>> after skb reach fragmentation part, CHECKSUM_PARTIAL denotes
>> L4 layer protocol need to be checksummed, then IPv4 checksum is
>> recomputed for each fragmented IPv4 packet.
>>
>> IPv6:
>> Here IPv6 doesn't need checksum for its header, again
>> CHECKSUM_PARTIAL denotes L4 layer protocol need to be checksummed.
>>
>> So all in all, this is the right place to distinguish SCTP skb out,
>> and skip checksum operation as hw does it thereafter.
>>
>
>   How does HW compute SCTP checksum when the data is split between skb?
> Each skb will be submitted separately to the HW.
> I think we need to fall back to SW checksum when packer will be fragmented.

Hi, Vlad

I understand your argument now, I finally applied a 82576 NIC with SCTP CHECKSUM
supported to test this. It seems when sending this super-sized packet,
sctp_datamsg_from_user fragments each packet to 1480 size already using pathmtu,
this pathmtu is equal or less than the interface device mtu, which means
ip_fragment/ip6_fragment didn't fragments any SCTP skb.

Host A(with 82576):
sctp_test -h 128.224.162.161 -p 5001 -H 128.224.162.220 -P 500 -x 1 -c 5 -s -T
                                                                      ^^^
                                                 set each packet to 32768 bytes

I cannot picture any scenario where a SCTP skb with CHECK_PARTIAL set to reach
ip_fragment/ip6_fragment. FWIW, the fix for xfrm_output part is definitely a
valid one.


> -vlad
>
>> Q.E.D.
>>
>>
>>> Looks like SCTP needs to compute the checksum in the case where
>>> skb will be fragmented.
>>>
>>> An alternative, that will also allow us to get rid of patch 1
>>> in the serices is to have a checksum handler offload function
>>> that can be used to compute checksum in this case.
>>>
>>> -vlad
>>>
>>>> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
>>>> index 3bb2cdc..ddef94a 100644
>>>> --- a/net/xfrm/xfrm_output.c
>>>> +++ b/net/xfrm/xfrm_output.c
>>>> @@ -180,6 +180,14 @@ static int xfrm_output_gso(struct sk_buff *skb)
>>>> return 0;
>>>> }
>>>>
>>>> +static int skb_is_sctp(const struct sk_buff *skb)
>>>> +{
>>>> + if (skb->protocol == __constant_htons(ETH_P_IP))
>>>> + return skb_is_sctpv4(skb);
>>>> + else
>>>> + return skb_is_sctpv6(skb);
>>>> +}
>>>> +
>>>> int xfrm_output(struct sk_buff *skb)
>>>> {
>>>> struct net *net = dev_net(skb_dst(skb)->dev);
>>>> @@ -189,11 +197,13 @@ int xfrm_output(struct sk_buff *skb)
>>>> return xfrm_output_gso(skb);
>>>>
>>>> if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>>> - err = skb_checksum_help(skb);
>>>> - if (err) {
>>>> - XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
>>>> - kfree_skb(skb);
>>>> - return err;
>>>> + if (!skb_is_sctp(skb)) {
>>>> + err = skb_checksum_help(skb);
>>>> + if (err) {
>>>> + XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
>>>> + kfree_skb(skb);
>>>> + return err;
>>>> + }
>>>> }
>>>> }
>>>>
>>>
>>>
>

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* Re: [RFC] ipv6: always join solicited-node address
From: Bjørn Mork @ 2013-10-14  7:09 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20131014005508.GE14021@order.stressinduktion.org>

Hannes Frederic Sowa <hannes@stressinduktion.org> writes:

> IFF_NOARP seems to be a bit messed up in ipv6. Your patch seems fine to
> me but I would add protection to the ndisc_rcv and sending routines to
> do nothing if IFF_NOARP is set for that interface.

That's what I thought.  So for my problem, this will not really change
much.

> So it would be possible that you could resolve this issue by just issuing
> an "ip link set arp on dev <interface>" and won't have hassle with racing
> interface initialization.

Yes, but that would also make the IP layer try to resolve IP to link
layer addressess both for IPv4 and IPv6, which just won't work. At least
not for IPv4, where there just is no way to transport an ARP to the
modem.  And I assume it may fail for IPv6 too on any sane device.

> Is this a specific bug of the modem you are using or are all devices
> powered by this driver like this?

Unfortunately I have no IPv6 enabled SIM myself, so I have no
information about other devices.  This report was based on user
feedback.

I assume the bug is specific to this firmware implementation, probably
extending to all similar devices from the same vendor.  But it could be
more common than that.  The fact that the bug is there indicates that
this works just fine in Windows.

Yes, I realize that I am in ugly-hack-to-workaround-firmware-issues land
again... But it sure would be nice to have some way for a driver to
indicate that L2 neighbour tables are meaningless, but that any incoming
requests should still be answered.

Bjørn

^ permalink raw reply

* Re: [PATCH net] tcp: fix incorrect ca_state in tail loss probe
From: Neal Cardwell @ 2013-10-14  5:55 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Nandita Dukkipati, Netdev, michael, jwboyer,
	Steinar H. Gunderson, dormando
In-Reply-To: <1381598187-9681-1-git-send-email-ycheng@google.com>

On Sat, Oct 12, 2013 at 1:16 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On receiving an ACK that covers the loss probe sequence, TLP
> immediately sets the congestion state to Open, even though some packets
> are not recovered and retransmisssion are on the way.  The later ACks
> may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g.,
> https://bugzilla.redhat.com/show_bug.cgi?id=989251
>
> The fix is to follow the similar procedure in recovery by calling
> tcp_try_keep_open(). The sender switches to Open state if no packets
> are retransmissted. Otherwise it goes to Disorder and let subsequent
> ACKs move the state to Recovery or Open.
>
> Reported-By: Michael Sterrett <michael@sterretts.net>
> Tested-By: Dormando <dormando@rydia.net>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

Nice catch!

neal

^ permalink raw reply

* [PATCH v3 net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-14  3:50 UTC (permalink / raw)
  To: David S. Miller
  Cc: Jesse Gross, Pravin B Shelar, Jiri Pirko, Cong Wang, dev, netdev

The combination of two commits:
commit 8e4e1713e4
("openvswitch: Simplify datapath locking.")
commit 2537b4dd0a
("openvswitch:: link upper device for port devices")

introduced a bug where upper_dev wasn't unlinked upon
netdev_unregister notification

The following steps:

  modprobe openvswitch
  ovs-dpctl add-dp test
  ip tuntap add dev tap1 mode tap
  ovs-dpctl add-if test tap1
  ip tuntap del dev tap1 mode tap

are causing multiple warnings:

[   62.747557] gre: GRE over IPv4 demultiplexor driver
[   62.749579] openvswitch: Open vSwitch switching datapath
[   62.755087] device test entered promiscuous mode
[   62.765911] device tap1 entered promiscuous mode
[   62.766033] IPv6: ADDRCONF(NETDEV_UP): tap1: link is not ready
[   62.769017] ------------[ cut here ]------------
[   62.769022] WARNING: CPU: 1 PID: 3267 at net/core/dev.c:5501 rollback_registered_many+0x20f/0x240()
[   62.769023] Modules linked in: openvswitch gre vxlan ip_tunnel libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap macvlan vhost kvm_intel kvm dm_crypt iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic mxm_wmi eeepc_wmi asus_wmi sparse_keymap dm_multipath psmouse serio_raw usbhid hid parport_pc ppdev firewire_ohci lpc_ich firewire_core e1000e crc_itu_t binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config i2o_block video
[   62.769051] CPU: 1 PID: 3267 Comm: ip Not tainted 3.12.0-rc3+ #60
[   62.769052] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[   62.769053]  0000000000000009 ffff8807f25cbd28 ffffffff8175e575 0000000000000006
[   62.769055]  0000000000000000 ffff8807f25cbd68 ffffffff8105314c ffff8807f25cbd58
[   62.769057]  ffff8807f2634000 ffff8807f25cbdc8 ffff8807f25cbd88 ffff8807f25cbdc8
[   62.769059] Call Trace:
[   62.769062]  [<ffffffff8175e575>] dump_stack+0x55/0x76
[   62.769065]  [<ffffffff8105314c>] warn_slowpath_common+0x8c/0xc0
[   62.769067]  [<ffffffff8105319a>] warn_slowpath_null+0x1a/0x20
[   62.769069]  [<ffffffff8162a04f>] rollback_registered_many+0x20f/0x240
[   62.769071]  [<ffffffff8162a101>] rollback_registered+0x31/0x40
[   62.769073]  [<ffffffff8162a488>] unregister_netdevice_queue+0x58/0x90
[   62.769075]  [<ffffffff8154f900>] __tun_detach+0x140/0x340
[   62.769077]  [<ffffffff8154fb36>] tun_chr_close+0x36/0x60
[   62.769080]  [<ffffffff811bddaf>] __fput+0xff/0x260
[   62.769082]  [<ffffffff811bdf5e>] ____fput+0xe/0x10
[   62.769084]  [<ffffffff8107b515>] task_work_run+0xb5/0xe0
[   62.769087]  [<ffffffff810029b9>] do_notify_resume+0x59/0x80
[   62.769089]  [<ffffffff813a41fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   62.769091]  [<ffffffff81770f5a>] int_signal+0x12/0x17
[   62.769093] ---[ end trace 838756c62e156ffb ]---
[   62.769481] ------------[ cut here ]------------
[   62.769485] WARNING: CPU: 1 PID: 92 at fs/sysfs/inode.c:325 sysfs_hash_and_remove+0xa9/0xb0()
[   62.769486] sysfs: can not remove 'master', no directory
[   62.769486] Modules linked in: openvswitch gre vxlan ip_tunnel libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap macvlan vhost kvm_intel kvm dm_crypt iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic mxm_wmi eeepc_wmi asus_wmi sparse_keymap dm_multipath psmouse serio_raw usbhid hid parport_pc ppdev firewire_ohci lpc_ich firewire_core e1000e crc_itu_t binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config i2o_block video
[   62.769514] CPU: 1 PID: 92 Comm: kworker/1:2 Tainted: G        W    3.12.0-rc3+ #60
[   62.769515] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[   62.769518] Workqueue: events ovs_dp_notify_wq [openvswitch]
[   62.769519]  0000000000000009 ffff880807ad3ac8 ffffffff8175e575 0000000000000006
[   62.769521]  ffff880807ad3b18 ffff880807ad3b08 ffffffff8105314c ffff880807ad3b28
[   62.769523]  0000000000000000 ffffffff81a87a1f ffff8807f2634000 ffff880037038500
[   62.769525] Call Trace:
[   62.769528]  [<ffffffff8175e575>] dump_stack+0x55/0x76
[   62.769529]  [<ffffffff8105314c>] warn_slowpath_common+0x8c/0xc0
[   62.769531]  [<ffffffff81053236>] warn_slowpath_fmt+0x46/0x50
[   62.769533]  [<ffffffff8123e7e9>] sysfs_hash_and_remove+0xa9/0xb0
[   62.769535]  [<ffffffff81240e96>] sysfs_remove_link+0x26/0x30
[   62.769538]  [<ffffffff81631ef7>] __netdev_adjacent_dev_remove+0xf7/0x150
[   62.769540]  [<ffffffff81632037>] __netdev_adjacent_dev_unlink_lists+0x27/0x50
[   62.769542]  [<ffffffff8163213a>] __netdev_adjacent_dev_unlink_neighbour+0x3a/0x50
[   62.769544]  [<ffffffff8163218d>] netdev_upper_dev_unlink+0x3d/0x140
[   62.769548]  [<ffffffffa033c2db>] netdev_destroy+0x4b/0x80 [openvswitch]
[   62.769550]  [<ffffffffa033b696>] ovs_vport_del+0x46/0x60 [openvswitch]
[   62.769552]  [<ffffffffa0335314>] ovs_dp_detach_port+0x44/0x60 [openvswitch]
[   62.769555]  [<ffffffffa0336574>] ovs_dp_notify_wq+0xb4/0x150 [openvswitch]
[   62.769557]  [<ffffffff81075c28>] process_one_work+0x1d8/0x6a0
[   62.769559]  [<ffffffff81075bc8>] ? process_one_work+0x178/0x6a0
[   62.769562]  [<ffffffff8107659b>] worker_thread+0x11b/0x370
[   62.769564]  [<ffffffff81076480>] ? rescuer_thread+0x350/0x350
[   62.769566]  [<ffffffff8107f44a>] kthread+0xea/0xf0
[   62.769568]  [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[   62.769570]  [<ffffffff81770bac>] ret_from_fork+0x7c/0xb0
[   62.769572]  [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[   62.769573] ---[ end trace 838756c62e156ffc ]---
[   62.769574] ------------[ cut here ]------------
[   62.769576] WARNING: CPU: 1 PID: 92 at fs/sysfs/inode.c:325 sysfs_hash_and_remove+0xa9/0xb0()
[   62.769577] sysfs: can not remove 'upper_test', no directory
[   62.769577] Modules linked in: openvswitch gre vxlan ip_tunnel libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap macvlan vhost kvm_intel kvm dm_crypt iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic mxm_wmi eeepc_wmi asus_wmi sparse_keymap dm_multipath psmouse serio_raw usbhid hid parport_pc ppdev firewire_ohci lpc_ich firewire_core e1000e crc_itu_t binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config i2o_block video
[   62.769603] CPU: 1 PID: 92 Comm: kworker/1:2 Tainted: G        W    3.12.0-rc3+ #60
[   62.769604] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[   62.769606] Workqueue: events ovs_dp_notify_wq [openvswitch]
[   62.769607]  0000000000000009 ffff880807ad3ac8 ffffffff8175e575 0000000000000006
[   62.769609]  ffff880807ad3b18 ffff880807ad3b08 ffffffff8105314c ffff880807ad3b58
[   62.769611]  0000000000000000 ffff880807ad3bd9 ffff8807f2634000 ffff880037038500
[   62.769613] Call Trace:
[   62.769615]  [<ffffffff8175e575>] dump_stack+0x55/0x76
[   62.769617]  [<ffffffff8105314c>] warn_slowpath_common+0x8c/0xc0
[   62.769619]  [<ffffffff81053236>] warn_slowpath_fmt+0x46/0x50
[   62.769621]  [<ffffffff8123e7e9>] sysfs_hash_and_remove+0xa9/0xb0
[   62.769622]  [<ffffffff81240e96>] sysfs_remove_link+0x26/0x30
[   62.769624]  [<ffffffff81631f22>] __netdev_adjacent_dev_remove+0x122/0x150
[   62.769627]  [<ffffffff81632037>] __netdev_adjacent_dev_unlink_lists+0x27/0x50
[   62.769629]  [<ffffffff8163213a>] __netdev_adjacent_dev_unlink_neighbour+0x3a/0x50
[   62.769631]  [<ffffffff8163218d>] netdev_upper_dev_unlink+0x3d/0x140
[   62.769633]  [<ffffffffa033c2db>] netdev_destroy+0x4b/0x80 [openvswitch]
[   62.769636]  [<ffffffffa033b696>] ovs_vport_del+0x46/0x60 [openvswitch]
[   62.769638]  [<ffffffffa0335314>] ovs_dp_detach_port+0x44/0x60 [openvswitch]
[   62.769640]  [<ffffffffa0336574>] ovs_dp_notify_wq+0xb4/0x150 [openvswitch]
[   62.769642]  [<ffffffff81075c28>] process_one_work+0x1d8/0x6a0
[   62.769644]  [<ffffffff81075bc8>] ? process_one_work+0x178/0x6a0
[   62.769646]  [<ffffffff8107659b>] worker_thread+0x11b/0x370
[   62.769648]  [<ffffffff81076480>] ? rescuer_thread+0x350/0x350
[   62.769650]  [<ffffffff8107f44a>] kthread+0xea/0xf0
[   62.769652]  [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[   62.769654]  [<ffffffff81770bac>] ret_from_fork+0x7c/0xb0
[   62.769656]  [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[   62.769657] ---[ end trace 838756c62e156ffd ]---
[   62.769724] device tap1 left promiscuous mode

This patch also affects moving devices between net namespaces.

OVS used to ignore netns move notifications which caused problems.
Like:
  ovs-dpctl add-if test tap1
  ip link set tap1 netns 3512
and then removing tap1 inside the namespace will cause hang on missing dev_put.

With this patch OVS will detach dev upon receiving netns move event.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 net/openvswitch/dp_notify.c    |   12 +++++-------
 net/openvswitch/vport-netdev.c |   16 +++++++++++++---
 net/openvswitch/vport-netdev.h |    1 +
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
index c323567..ffa429a 100644
--- a/net/openvswitch/dp_notify.c
+++ b/net/openvswitch/dp_notify.c
@@ -59,15 +59,9 @@ void ovs_dp_notify_wq(struct work_struct *work)
 			struct hlist_node *n;
 
 			hlist_for_each_entry_safe(vport, n, &dp->ports[i], dp_hash_node) {
-				struct netdev_vport *netdev_vport;
-
 				if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
 					continue;
-
-				netdev_vport = netdev_vport_priv(vport);
-				if (netdev_vport->dev->reg_state == NETREG_UNREGISTERED ||
-				    netdev_vport->dev->reg_state == NETREG_UNREGISTERING)
-					dp_detach_port_notify(vport);
+				dp_detach_port_notify(vport);
 			}
 		}
 	}
@@ -88,6 +82,10 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
 		return NOTIFY_DONE;
 
 	if (event == NETDEV_UNREGISTER) {
+		/* upper_dev_unlink and decrement promisc immediately */
+		ovs_netdev_detach_dev(vport);
+
+		/* schedule vport destroy, dev_put and genl notification */
 		ovs_net = net_generic(dev_net(dev), ovs_net_id);
 		queue_work(system_wq, &ovs_net->dp_notify_work);
 	}
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 09d93c1..d21f77d 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -150,15 +150,25 @@ static void free_port_rcu(struct rcu_head *rcu)
 	ovs_vport_free(vport_from_priv(netdev_vport));
 }
 
-static void netdev_destroy(struct vport *vport)
+void ovs_netdev_detach_dev(struct vport *vport)
 {
 	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
 
-	rtnl_lock();
+	ASSERT_RTNL();
 	netdev_vport->dev->priv_flags &= ~IFF_OVS_DATAPATH;
 	netdev_rx_handler_unregister(netdev_vport->dev);
-	netdev_upper_dev_unlink(netdev_vport->dev, get_dpdev(vport->dp));
+	netdev_upper_dev_unlink(netdev_vport->dev,
+				netdev_master_upper_dev_get(netdev_vport->dev));
 	dev_set_promiscuity(netdev_vport->dev, -1);
+}
+
+static void netdev_destroy(struct vport *vport)
+{
+	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
+
+	rtnl_lock();
+	if (netdev_vport->dev->priv_flags & IFF_OVS_DATAPATH)
+		ovs_netdev_detach_dev(vport);
 	rtnl_unlock();
 
 	call_rcu(&netdev_vport->rcu, free_port_rcu);
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index dd298b5..8df01c11 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -39,5 +39,6 @@ netdev_vport_priv(const struct vport *vport)
 }
 
 const char *ovs_netdev_get_name(const struct vport *);
+void ovs_netdev_detach_dev(struct vport *);
 
 #endif /* vport_netdev.h */
-- 
1.7.9.5

^ permalink raw reply related

* Re: [ovs-dev] [PATCH v2 net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-14  3:23 UTC (permalink / raw)
  To: Cong Wang; +Cc: dev, netdev
In-Reply-To: <slrnl5m3oc.3lg.xiyou.wangcong@linux-6brj.site>

On Sun, Oct 13, 2013 at 2:22 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Sat, 12 Oct 2013 at 01:12 GMT, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> @@ -87,7 +81,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>       if (!vport)
>>               return NOTIFY_DONE;
>>
>> -     if (event == NETDEV_UNREGISTER) {
>> +     if (event == NETDEV_UNREGISTER && dev->priv_flags & IFF_OVS_DATAPATH) {
>> +             /* upper_dev_unlink and decrement promisc immediately */
>> +             ovs_netdev_detach_dev(vport);
>> +
>> +             /* schedule vport destroy, dev_put and genl notification */
>
> ovs_netdev_get_vport() already checks IFF_OVS_DATAPATH flag before this 'if'.

:) good point. will do v3.

^ permalink raw reply

* Re: [PATCH 2/3] net: bpf jit: x86: optimize choose_load_func error path
From: Vladimir Murzin @ 2013-10-14  2:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, edumazet, av1474
In-Reply-To: <1381682194.3392.42.camel@edumazet-glaptop.roam.corp.google.com>

On Sun, Oct 13, 2013 at 09:36:34AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-13 at 16:54 +0200, Vladimir Murzin wrote:
> > Macro CHOOSE_LOAD_FUNC returns handler for "any offset" if checks for K
> > were not passed. At the same time handlers for "any offset" cases make
> > the same checks against r_addr at run-time, that will always lead to
> > bpf_error.
> > 
> > Run-time checks are still necessary for indirect load operations, but
> > error path for absolute and mesh loads are worth to optimize during bpf
> > compile time.
> 
> I don't get the point.
> 
> What real world use case or problem are you trying to handle ?
> 
> bpf_error returns 0, so it seems your patch does the same.
> 
> A buggy BPF program should not expect us to 'save' a few cycles.
> 
> 
> 

Hi Eric!

There is no real world use case for me - it was eliminated by plain code
reading. The patch is not supposed to change behavior of BPF program - only
optimization of the error path. 
I agree with, you there is no significant reason for optimizations of rarely
used pice of code. However, it is not only saving pipeline cycles and I-cache
lines for usually "never taken" branch. In case this "never taken" branch is a
buggy part of BPF program, we can avoid extra instructions and save space for
the rest of BPF program - there is no need to care about seen flags of buggy
part anymore.

Anyway, if you still think it's not good enough - just throw it away ;)

Thanks
Vladimir

^ permalink raw reply

* [RFC PATCH v2 1/1] Workqueue based vhost workers
From: Bandan Das @ 2013-10-14  1:55 UTC (permalink / raw)
  To: kvm; +Cc: netdev, Michael Tsirkin, Jason Wang, Bandan Das
In-Reply-To: <1381715743-13672-1-git-send-email-bsd@redhat.com>

Signed-off-by: Bandan Das <bsd@makefile.in>
---
 drivers/vhost/net.c   |  25 +++++++++++
 drivers/vhost/vhost.c | 115 +++++++++++++++++++++++++++++++++++++++++++-------
 drivers/vhost/vhost.h |   6 +++
 3 files changed, 130 insertions(+), 16 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 831eb4f..f5307d7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -34,6 +34,10 @@ module_param(experimental_zcopytx, int, 0444);
 MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 		                       " 1 -Enable; 0 - Disable");
 
+static int cmwq_worker;
+module_param(cmwq_worker, int, 0444);
+MODULE_PARM_DESC(cmwq_worker, "Use cmwq for worker threads - Experimental, 1 - Enable; 0 - Disable");
+
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
@@ -694,6 +698,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	}
 
 	dev = &n->dev;
+	dev->use_wq = 0;
 	vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
 	vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
 	n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
@@ -706,6 +711,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
 	}
+
+	if (cmwq_worker)
+		dev->use_wq = 1;
+
 	r = vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 	if (r < 0) {
 		kfree(n);
@@ -1123,14 +1132,30 @@ static struct miscdevice vhost_net_misc = {
 
 static int vhost_net_init(void)
 {
+	int ret = 0;
+
 	if (experimental_zcopytx)
 		vhost_net_enable_zcopy(VHOST_NET_VQ_TX);
+
+	if (cmwq_worker) {
+		ret = vhost_wq_init();
+		if (ret) {
+			pr_info("Enabling wq based vhost workers failed! "
+				 "Switching to device based worker instead\n");
+			cmwq_worker = 0;
+		} else
+		  pr_info("Enabled workqueues based vhost workers\n");
+	}
+
 	return misc_register(&vhost_net_misc);
 }
 module_init(vhost_net_init);
 
 static void vhost_net_exit(void)
 {
+	if (cmwq_worker)
+		vhost_wq_cleanup();
+
 	misc_deregister(&vhost_net_misc);
 }
 module_exit(vhost_net_exit);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 69068e0..ba7ff7a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,6 +37,9 @@ enum {
 #define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
 
+static struct workqueue_struct *qworker;
+static void vhost_submission_workfn(struct work_struct *qwork);
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -162,7 +165,10 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		list_add_tail(&work->node, &dev->work_list);
 		work->queue_seq++;
 		spin_unlock_irqrestore(&dev->work_lock, flags);
-		wake_up_process(dev->worker);
+		if (dev->use_wq)
+			queue_work(qworker, &dev->qwork);
+		else
+			wake_up_process(dev->worker);
 	} else {
 		spin_unlock_irqrestore(&dev->work_lock, flags);
 	}
@@ -307,6 +313,9 @@ long vhost_dev_init(struct vhost_dev *dev,
 	INIT_LIST_HEAD(&dev->work_list);
 	dev->worker = NULL;
 
+	if (dev->use_wq)
+		INIT_WORK(&dev->qwork, vhost_submission_workfn);
+
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
 		vq->log = NULL;
@@ -367,7 +376,7 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
+	struct task_struct *worker = NULL;
 	int err;
 
 	/* Is there an owner already? */
@@ -376,28 +385,35 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 		goto err_mm;
 	}
 
+	err = vhost_dev_alloc_iovecs(dev);
+	if (err)
+		goto err_cgroup;
+
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
-	}
 
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
+	if (!dev->use_wq) {
+		worker = kthread_create(vhost_worker,
+					dev, "vhost-%d", current->pid);
+		if (IS_ERR(worker)) {
+			err = PTR_ERR(worker);
+			goto err_worker;
+		}
 
-	err = vhost_attach_cgroups(dev);
-	if (err)
-		goto err_cgroup;
+		dev->worker = worker;
+		/* avoid contributing to loadavg */
+		wake_up_process(worker);
 
-	err = vhost_dev_alloc_iovecs(dev);
-	if (err)
-		goto err_cgroup;
+		err = vhost_attach_cgroups(dev);
+		if (err)
+			goto err_cgroup;
+	} /* else don't worry, we are using wqs for vhost work */
 
 	return 0;
+
 err_cgroup:
-	kthread_stop(worker);
+	if (worker)
+		kthread_stop(worker);
 	dev->worker = NULL;
 err_worker:
 	if (dev->mm)
@@ -1539,6 +1555,73 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_disable_notify);
 
+static void vhost_submission_workfn(struct work_struct *qwork)
+{
+	struct vhost_dev *dev =
+		container_of(qwork, struct vhost_dev, qwork);
+	struct vhost_work *work = NULL;
+	unsigned uninitialized_var(seq);
+	struct mm_struct *prev_mm = NULL;
+	mm_segment_t oldfs = get_fs();
+
+	set_fs(USER_DS);
+
+	for (;;) {
+
+		spin_lock_irq(&dev->work_lock);
+
+		if (list_empty(&dev->work_list)) {
+			spin_unlock(&dev->work_lock);
+			break;
+		}
+
+		work = list_first_entry(&dev->work_list,
+					struct vhost_work, node);
+		list_del_init(&work->node);
+		seq = work->queue_seq;
+
+		if (prev_mm != dev->mm) {
+			if (prev_mm)
+				unuse_mm(prev_mm);
+			prev_mm = dev->mm;
+			use_mm(prev_mm);
+		}
+
+		spin_unlock_irq(&dev->work_lock);
+
+		if (work) {
+			work->fn(work);
+
+			spin_lock_irq(&dev->work_lock);
+			work->done_seq = seq;
+			if (work->flushing)
+				wake_up_all(&work->done);
+			spin_unlock_irq(&dev->work_lock);
+
+		}
+	}
+
+	if (prev_mm)
+		unuse_mm(prev_mm);
+	set_fs(oldfs);
+}
+
+int vhost_wq_init(void)
+{
+	qworker = alloc_workqueue("vhost_worker",
+			WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
+	if (!qworker)
+		return -ENOMEM;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_wq_init);
+
+void vhost_wq_cleanup(void)
+{
+	destroy_workqueue(qworker);
+}
+EXPORT_SYMBOL_GPL(vhost_wq_cleanup);
+
 static int __init vhost_init(void)
 {
 	return 0;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4465ed5..3f6c147 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -125,6 +125,8 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	int use_wq;
+	struct work_struct qwork;
 };
 
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
@@ -161,6 +163,10 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
 
+/* Experimental cmwq decls */
+int  vhost_wq_init(void);
+void vhost_wq_cleanup(void);
+
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \
-- 
1.8.3.1

^ permalink raw reply related

* [RFC PATCH v2 0/1] Workqueue based vhost work scheduling
From: Bandan Das @ 2013-10-14  1:55 UTC (permalink / raw)
  To: kvm; +Cc: netdev, Michael Tsirkin, Jason Wang

This is a followup to RFC posted by Shirley Ma on 22 March 2012 :
NUMA aware scheduling per vhost thread patch [1]. This patch is against
3.12-rc4.

This is a step-down from the previous version in the sense that
this patch utilizes the workqueue mechanism instead of creating per-cpu
vhost threads, or in other words, the per-cpu threads are completely invisible
to vhost as they are the responsibility of the cmwq implementation. 

The workqueue implementation [2] maintains a pool of dedicated threads per CPU
that are used when work is queued. The user can control certain aspects of the 
work execution using special flags that can be passed along during the call
to alloc_workqueue. Based on this description, the approach is that instead of 
vhost creating per-cpu thread to address issues pointed out
in RFC v1, we just let the cmwq mechanism do the heavy lifting for us.
The end result is that the changes in v2 are substantially smaller compared
to v1.

The major changes wrt v1 :
 - A module param called cmwq_worker, that when enabled, uses the wq 
   backend
 - vhost doesn't manage any per cpu threads anymore, trust wq backend to 
   do the right thing.
 - A significant part of v1 was to decide where to run the job - this is 
   gone now for reasons discussed above.

Testing :

As of now, some basic netperf testing varying only the message sizes 
keeping all other factors constant (to keep it simple) I however agree that 
this needs more testing for more concrete conclusions. 

The host is Nehalem 4 cores x 4 sockets with 4G memory, cpu 0-7 - numa node0
and cpu 8-16 = numa node1.
The host is running 4 guests with -smp 4 and -m 1G to keep it somewhat 
realistic. netperf in Guest 0 interacts with netserv running in the host
for our test results.

Results :

I noticed a common signature in all the tests except UDP_RR - 
for small message sizes, the workqueue implementation has a slightly better
throughput, however with increase in message size, the throughput
degrades slightly compared to the unpatched version. I suspect that the 
vhost_submission_workfn can be modified to make this better or there could be 
other factors that I still haven't thought about. Ofcourse, we shouldn't 
forget the important condition that we are not running on a vhost 
specific dedicated thread anymore.

UDP_RR however, exhibited better results constantly for the wq version.

I include the figures for just TCP_STREAM and UDP_RR below :

TCP_STREAM

Size      Throughput (Without patch)      Throughput (With patch)
bytes          10^6bytes/sec                 10^6bytes/sec
--------------------------------------------------------------------------
256                2695.22                     2793.14
512                5682.10                     5896.34
1024               7511.18                     7295.96
2048               8197.94                     7564.50
4096               8764.95                     7822.98
8192               8205.89                     8046.49
16384              11495.72                    11101.35

UDP_RR
Size            (Without patch)            (With patch)
bytes            Trans/sec                  Trans/sec
--------------------------------------------------------------------------
256                10966.77	             14842.16
512                9930.06                   14747.76
1024               10587.85	             14544.10
2048               7172.34		     13790.56
4096               7628.35		     13333.39
8192               5663.10                   11916.82
16384              6807.25		     9994.11

I had already discussed my results with Michael privately, so sorry for 
the duplicate information, Michael!

[1] http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
[2] Documentation/workqueue.txt 

Bandan Das (1):
  Workqueue based vhost workers

 drivers/vhost/net.c   |  25 +++++++++++
 drivers/vhost/vhost.c | 115 +++++++++++++++++++++++++++++++++++++++++++-------
 drivers/vhost/vhost.h |   6 +++
 3 files changed, 130 insertions(+), 16 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [RFC] ipv6: always join solicited-node address
From: Hannes Frederic Sowa @ 2013-10-14  0:55 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: netdev
In-Reply-To: <1381685064-24805-1-git-send-email-bjorn@mork.no>

On Sun, Oct 13, 2013 at 07:24:24PM +0200, Bjørn Mork wrote:
> RFC 4861 section 7.2.1 "Interface Initialization" says:
> 
>    When a multicast-capable interface becomes enabled, the node MUST
>    join the all-nodes multicast address on that interface, as well as
>    the solicited-node multicast address corresponding to each of the IP
>    addresses assigned to the interface.
> 
> The current dependency on IFF_NOARP seems unwarranted. We need to
> listen on the solicited-node address whether or not we intend to
> initiate Neigbour Discovery ourselves.
> 
> This fixes a bug where Linux fails to respond to received Neigbour
> Solicitations on multicast capable links when IFF_NOARP is set.
> 
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> I am not at all sure about this... Comments are appreciated.
> 
> The observed problem is a MBIM mobile broadband modem sending NS
> to the host.  MBIM is a point-to-point USB protocol which does not
> have any L2 headers at all.  It can only transport IPv4 or IPv6
> packets.  So for IPv4 there is no question at all:  ARP just
> cannot be transported. The driver emulates an ethernet interface,
> setting IFF_NOARP to make sure the upper layers doesn't attempt
> to resolve the neighbours non-existing L2 addresses.
> 
> But then there is this modem which sends IPv6 Neigbour
> Solicitations to the host over the MBIM transport. The link
> layer addresses are meaningless, but it seems the modem still
> expects an answer.  Which we will not currently provide, because
> the NS is addressed to a solicited-node address we don't listen
> to.
> 
> So this patch seems like a quick-fix to that problem.  But it does
> change the semantics of IFF_NOARP, making us reply to NS even if
> this flag is set.  Which probably is wrong?

IFF_NOARP seems to be a bit messed up in ipv6. Your patch seems fine to
me but I would add protection to the ndisc_rcv and sending routines to
do nothing if IFF_NOARP is set for that interface.

So it would be possible that you could resolve this issue by just issuing
an "ip link set arp on dev <interface>" and won't have hassle with racing
interface initialization.

Is this a specific bug of the modem you are using or are all devices
powered by this driver like this?

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH] staging: octeon-ethernet: trivial: Avoid OOPS if phydev is not set
From: Greg KH @ 2013-10-13 21:28 UTC (permalink / raw)
  To: Sebastian Pöhn; +Cc: support, netdev, driverdev-devel
In-Reply-To: <1381690794.2848.11.camel@alpha.Speedport_W723_V_Typ_A_1_00_098>

On Sun, Oct 13, 2013 at 08:59:54PM +0200, Sebastian Pöhn wrote:
> A zero pointer deref on priv->phydev->link was causing oops on our systems.
> Might be related to improper configuration but we should fail gracefully here ;-)
> 
> Signed-off-by: Sebastian Poehn <sebastian.poehn@googlemail.com>
> 
> ---
> 
> diff --git a/drivers/staging/octeon/ethernet-mdio.c b/drivers/staging/octeon/ethernet-mdio.c
> index 83b1030..bc8c503 100644
> --- a/drivers/staging/octeon/ethernet-mdio.c
> +++ b/drivers/staging/octeon/ethernet-mdio.c
> @@ -121,6 +121,9 @@ static void cvm_oct_adjust_link(struct net_device *dev)
>         struct octeon_ethernet *priv = netdev_priv(dev);
>         cvmx_helper_link_info_t link_info;
>  
> +       if(!priv->phydev)
> +               return ;

Please always run your patches through the scripts/checkpatch.pl tool so
that maintainers don't have to point out the obvious coding syle errors
by hand each time :)

Care to try again?

Also, how was phydev NULL?  What was causing that?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v2 net-next] openvswitch: fix vport-netdev unregister
From: Cong Wang @ 2013-10-13 21:22 UTC (permalink / raw)
  To: netdev; +Cc: dev
In-Reply-To: <1381540347-3679-1-git-send-email-ast@plumgrid.com>

On Sat, 12 Oct 2013 at 01:12 GMT, Alexei Starovoitov <ast@plumgrid.com> wrote:
> @@ -87,7 +81,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>  	if (!vport)
>  		return NOTIFY_DONE;
>  
> -	if (event == NETDEV_UNREGISTER) {
> +	if (event == NETDEV_UNREGISTER && dev->priv_flags & IFF_OVS_DATAPATH) {
> +		/* upper_dev_unlink and decrement promisc immediately */
> +		ovs_netdev_detach_dev(vport);
> +
> +		/* schedule vport destroy, dev_put and genl notification */

ovs_netdev_get_vport() already checks IFF_OVS_DATAPATH flag before this 'if'.

^ permalink raw reply

* Re: [PATCH 2/2] ixgbe: enable l2 forwarding acceleration for macvlans
From: John Fastabend @ 2013-10-13 20:48 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, Andy Gospodarek, David Miller
In-Reply-To: <1381517037-26007-3-git-send-email-nhorman@tuxdriver.com>

On 10/11/2013 11:43 AM, Neil Horman wrote:
> Now that l2 acceleration ops are in place from the prior patch, enable ixgbe to
> take advantage of these operations.  Allow it to allocate queues for a macvlan
> so that when we transmit a frame, we can do the switching in hardware inside the
> ixgbe card, rather than in software.
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: john.fastabend@gmail.com
> CC: Andy Gospodarek <andy@greyhouse.net>
> CC: "David S. Miller" <davem@davemloft.net>
> ---

Neil, I'm fairly sure I can simplify this patch further. I'll be able
to take a shot at it Tuesday if you don't mind waiting.

I can resubmit the series then and preserve your signed-off on at least
the first patch. Seem reasonable?

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH 1/2] net: Add layer 2 hardware acceleration operations for macvlan devices
From: John Fastabend @ 2013-10-13 20:46 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, Andy Gospodarek, David Miller
In-Reply-To: <1381517037-26007-2-git-send-email-nhorman@tuxdriver.com>

On 10/11/2013 11:43 AM, Neil Horman wrote:
> Add a operations structure that allows a network interface to export the fact
> that it supports package forwarding in hardware between physical interfaces and
> other mac layer devices assigned to it (such as macvlans).  this operaions
> structure can be used by virtual mac devices to bypass software switching so
> that forwarding can be done in hardware more efficiently.
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: john.fastabend@gmail.com
> CC: Andy Gospodarek <andy@greyhouse.net>
> CC: "David S. Miller" <davem@davemloft.net>
> ---

[...]

>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2ddb48d..1710fdb 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -426,9 +426,9 @@ struct sk_buff {
>   	char			cb[48] __aligned(8);
>
>   	unsigned long		_skb_refdst;
> -#ifdef CONFIG_XFRM
> +

Is this a hold-over from the previous patches? 'sp' isn't touched
anywhere else so put the ifdef/endif back.

>   	struct	sec_path	*sp;
> -#endif
> +
>   	unsigned int		len,
>   				data_len;
>   	__u16			mac_len,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5c713f2..ecad8c2 100644

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Hannes Frederic Sowa @ 2013-10-13 20:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, Eric Dumazet, Josh Triplett, linux-kernel,
	mingo, laijs, dipankar, akpm, niv, tglx, peterz, rostedt,
	dhowells, edumazet, darren, fweisbec, sbw, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev
In-Reply-To: <20131013111439.GE5790@linux.vnet.ibm.com>

On Sun, Oct 13, 2013 at 04:14:39AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 12, 2013 at 07:42:18PM +0000, Mathieu Desnoyers wrote:
> > ----- Original Message -----
> > > On Sat, Oct 12, 2013 at 06:43:45PM +0200, Hannes Frederic Sowa wrote:
> > > > Regarding the volatile access, I hope that the C11 memory model
> > > > and enhancements to the compiler will some day provide a better
> > > > way to express the semantics of what is tried to express here
> > > > (__atomic_store_n/__atomic_load_n with the accompanied memory model,
> > > > which could be even weaker to what a volatile access would enfore
> > > > now and could guarantee atomic stores/loads).
> > > 
> > > I just played around a bit more. Perhaps we could try to warn of silly
> > > usages of ACCESS_ONCE():
> > > 
> > > --- a/include/linux/compiler.h
> > > +++ b/include/linux/compiler.h
> > > @@ -349,7 +349,11 @@ void ftrace_likely_update(struct ftrace_branch_data *f,
> > > int val, int expect);
> > >   * use is to mediate communication between process-level code and irq/NMI
> > >   * handlers, all running on the same CPU.
> > >   */
> > > -#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> > > +#define ACCESS_ONCE(x) (*({						\
> > > +		compiletime_assert(sizeof(typeof(x)) <= sizeof(typeof(&x)), \
> > > +				   "ACCESS_ONCE likely not atomic");	\
> > 
> > AFAIU, ACCESS_ONCE() is not meant to ensure atomicity of load/store,
> > but rather merely ensures that the compiler will not merge nor refetch
> > accesses. I don't think the assert check you propose is appropriate with
> > respect to the ACCESS_ONCE() semantic.
> 
> I am with Mathieu on this one, at least unless there is some set of actual
> bugs already in the kernel that these length checks would find.

I guess my wording of "ACCESS_ONCE likely not atomic" was misplaced. Something
like volatile access to memory larger than the processor register size is
probably not what you intended. Use atomics or proper locking. ;)
And maybe that is not even correct.

> /me wonders about structs of size 3, 5, 6, and 7...

Checked a x86_64 allyesconfig build with sizes above pointer size and odd
parity and nothing broke.

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH] rtlwifi: Add new firmware files for rtl8188eu
From: Ben Hutchings @ 2013-10-13 20:07 UTC (permalink / raw)
  To: Larry Finger; +Cc: dwmw2, netdev, linux-wireless
In-Reply-To: <52549519.1010207@lwfinger.net>

[-- Attachment #1: Type: text/plain, Size: 626 bytes --]

On Tue, 2013-10-08 at 18:28 -0500, Larry Finger wrote:
> The vendor driver RTL8188EUS_linux_v4.1.4_6773.20130222 contains
> firmware in the form of data statements. This info has been extracted
> into a binary file.
> 
> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
> ---
>   WHENCE                  |   9 +++++++++
>   rtlwifi/rtl8188eufw.bin | Bin 0 -> 13904 bytes
>   2 files changed, 9 insertions(+)
>   create mode 100644 rtlwifi/rtl8188eufw.bin
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings
Beware of bugs in the above code;
I have only proved it correct, not tried it. - Donald Knuth

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [RFC] net: stmmac: keep RXC running in LPI mode to avoid system overload
From: Romain Baeriswyl @ 2013-10-13 20:02 UTC (permalink / raw)
  To: Giuseppe CAVALLARO
  Cc: David S. Miller, netdev, linux-kernel, Christian Ruppert
In-Reply-To: <52557606.8030201@st.com>

Hello Guiseppe,

Thanks for your answer. Please find below some details and answers.

> > In order to avoid system overload, the clock RXC from the Phy
> > should not be
> > stopped when in LPI mode.
> >
> > With the RTL8211E PHY which support EEE mode and with Apple Airport
> > Extreme that
> > supports it also, the kernel get frozen as soon as some Ethernet
> > transfers are
> 
> 
> hmm, I have a board with this transceiver so I could do some tests
> this could take a while, unfortunately.
> 
> > on-going. System seems to be overloaded with too many interrupts.
> > The 'top'
> > command reports often around ~80% irq.
> 
> do you mean lpi mac interrupts?

By disabling the LPI mode with ethtool --set-eee eth0 eee off, the 
interrupt overload remains. Both counters irq_rx_path_in_lpi_mode_n and 
irq_rx_path_exit_lpi_mode_n continue to run meaning the PHY continue
to put the RX path in LPI mode.

Only way I found to get ride of the overload is to keep the RX_CLK running
in LPI mode. In this configuration, the RX link still continue to go in
LPI mode and the two above counters continue to run.

> >
> > By letting the RXC clock running even in LPI mode as proposed
> > below, the issue
> > seems solved. Is it the right way to proceed?
> 
> For EEE capability, RX_CLK may be halted for this reason i used it as
> default in the stmmac and never seen your issue.
> 
> >
> > What is the power impact to not stop RXC in LPI mode?
> 
> I can point you to "22.2.2.8a Receive direction LPI transition"
> in IEEE802-3az... where is it reported that the PHY  halt the RX_CLK
> in LPI mode.

Actually the PHY "may" stop RX_CLK.

> 
> May I suspect some issues on your HW? or disable it with ethtool
> 

Yes, it could be HW issue. It would be very useful if you could reproduce 
the setup using a SoC with "DesignWare Cores Ethernet MAC" core, 
the RTL8211E PHY and an Apple Airport Extreme. The issue could well be between 
the controller and the PHY.

Which Ethernet switch or router did you use to test the EEE mode? 
I may try these equipments as well.

Romain

^ permalink raw reply

* [PATCH] net/ethernet: cpsw: Bugfix interrupts before enabling napi
From: Markus Pargmann @ 2013-10-13 19:17 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Fainelli, Mugunthan V N, Peter Korsgaard,
	linux-arm-kernel, netdev, kernel, Markus Pargmann

If interrupts happen before napi_enable was called, the driver will not
work as expected. Network transmissions are impossible in this state.
This bug can be reproduced easily by restarting the network interface in
a loop. After some time any network transmissions on the network
interface will fail.

This patch fixes the bug by enabling napi before enabling the network
interface interrupts.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
---
 drivers/net/ethernet/ti/cpsw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 804846e..fccd9d4 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1169,9 +1169,9 @@ static int cpsw_ndo_open(struct net_device *ndev)
 		}
 	}
 
+	napi_enable(&priv->napi);
 	cpdma_ctlr_start(priv->dma);
 	cpsw_intr_enable(priv);
-	napi_enable(&priv->napi);
 	cpdma_ctlr_eoi(priv->dma, CPDMA_EOI_RX);
 	cpdma_ctlr_eoi(priv->dma, CPDMA_EOI_TX);
 
-- 
1.8.4.rc3

^ permalink raw reply related

* RE: INVESTMENT/ RELOCATION ASSISTANCE. 13th/10/2013
From: Mrs. Maryann Jamila Hussein. @ 2013-10-13 19:04 UTC (permalink / raw)
  To: Recipients

Dear Beloved, 

I am Mrs. Maryann Jamila Hussein a Teacher and a Muslim Convert here in Syria,i had sent a previous mail which i am not sure you got. I need your assistance to invest and help me relocate my 3 kids who are 17 years and below, so that they can get a better life there in your country due to the on going crises here in Syria.

I need your trust, before the death of my husband we had a savings with an Indian Bank, so money is not the issue. 

I got your reference in my search for someone who suits my 
purpose.If you can help me reply, let me know.

Regards,

Mrs. Maryann Jamila Hussein.
=====================================

^ permalink raw reply

* [PATCH] staging: octeon-ethernet: trivial: Avoid OOPS if phydev is not set
From: Sebastian Pöhn @ 2013-10-13 18:59 UTC (permalink / raw)
  To: driverdev-devel; +Cc: support, netdev

A zero pointer deref on priv->phydev->link was causing oops on our systems.
Might be related to improper configuration but we should fail gracefully here ;-)

Signed-off-by: Sebastian Poehn <sebastian.poehn@googlemail.com>

---

diff --git a/drivers/staging/octeon/ethernet-mdio.c b/drivers/staging/octeon/ethernet-mdio.c
index 83b1030..bc8c503 100644
--- a/drivers/staging/octeon/ethernet-mdio.c
+++ b/drivers/staging/octeon/ethernet-mdio.c
@@ -121,6 +121,9 @@ static void cvm_oct_adjust_link(struct net_device *dev)
        struct octeon_ethernet *priv = netdev_priv(dev);
        cvmx_helper_link_info_t link_info;
 
+       if(!priv->phydev)
+               return ;
+
        if (priv->last_link != priv->phydev->link) {
                priv->last_link = priv->phydev->link;
                link_info.u64 = 0;

^ permalink raw reply related

* CONTACT MY SECRETARY,
From: Musa Mohamed Ali @ 2013-10-13 18:49 UTC (permalink / raw)


Dear Friend,

I'm sorry but happy to inform you about my success in getting those
funds transferred under the cooperation of a new partner from  London
though i tried my best to involve you in the business but God decided
the whole situations. Presently I’m in London for investment projects
with my own share of the total sum. Meanwhile, I didn't forget your
past efforts and attempts to assist me in transferring those funds
despite that it failed us some how.

Now contact my secretary in  Burkina Faso  her name is Ms. Elizabeth
Ibrahim on her e-mail address below  ( miss.elizabeth0001@gmail.com )
Ask her to send you the total of $1, 000.000.00 which i kept for your
compensation for all the past efforts and attempts to assist me in
this matter. I appreciated all your efforts at that time very much. So
feel free and get in touched with my secretary Ms. Elizabeth Ibrahim.
And instruct her where to send the amount to you. Please do let me
know immediately you receive it so that we can share the joy after all
the suffer ness at that time.

in the moment, I’m very busy here because of the investment projects
which I and the new partner are having at hand, finally, remember that
I had forwarded instruction to the secretary on your behalf to receive
that money, so feel free to get in touch with Ms. Elizabeth Ibrahim.
She will send the amount to you without any delay.

Regards,
Musa.

^ permalink raw reply

* Re: IPv6 path MTU discovery broken
From: Hannes Frederic Sowa @ 2013-10-13 17:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Steinar H. Gunderson, netdev, edumazet, fan.du
In-Reply-To: <1381683063.3392.46.camel@edumazet-glaptop.roam.corp.google.com>

On Sun, Oct 13, 2013 at 09:51:03AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-13 at 12:40 +0200, Steinar H. Gunderson wrote:
> > [resending with proper quoting and cc]
> > 
> > On Mon, Oct 07, 2013 at 05:09:10AM +0200, Hannes Frederic Sowa wrote:
> > > Could you try to check (with e.g. nstat) if any of the following counters
> > > change if the icmp messages hit the host?
> > > 
> > > TcpExtOutOfWindowIcmps
> > > Icmp6InErrors
> > > TcpExtTCPMinTTLDrop
> > > TcpExtListenDrops
> > 
> > Hi,
> > 
> > I am now seeing the same problem against a 3.9.0 machine (that I've never had
> > problems with before). I looked through the four nstat counters, and none of
> > them are increasing. Here's /proc/net/ipv6_route from the server while it's
> > going on; the client is 2001:67c:a4:1:71f9:c84b:8b2a:c65b (second line) and
> > the server is 2001:67c:29f4::2007.
> 
> I am not sure, but the skb_is_gso() checks seem too lazy.
> 
> Why don't we check that gso_size+headers is below mtu, I dont know.

Was this intended for the UFO gso thread? ;)

^ permalink raw reply

* [RFC] ipv6: always join solicited-node address
From: Bjørn Mork @ 2013-10-13 17:24 UTC (permalink / raw)
  To: netdev; +Cc: Bjørn Mork

RFC 4861 section 7.2.1 "Interface Initialization" says:

   When a multicast-capable interface becomes enabled, the node MUST
   join the all-nodes multicast address on that interface, as well as
   the solicited-node multicast address corresponding to each of the IP
   addresses assigned to the interface.

The current dependency on IFF_NOARP seems unwarranted. We need to
listen on the solicited-node address whether or not we intend to
initiate Neigbour Discovery ourselves.

This fixes a bug where Linux fails to respond to received Neigbour
Solicitations on multicast capable links when IFF_NOARP is set.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
I am not at all sure about this... Comments are appreciated.

The observed problem is a MBIM mobile broadband modem sending NS
to the host.  MBIM is a point-to-point USB protocol which does not
have any L2 headers at all.  It can only transport IPv4 or IPv6
packets.  So for IPv4 there is no question at all:  ARP just
cannot be transported. The driver emulates an ethernet interface,
setting IFF_NOARP to make sure the upper layers doesn't attempt
to resolve the neighbours non-existing L2 addresses.

But then there is this modem which sends IPv6 Neigbour
Solicitations to the host over the MBIM transport. The link
layer addresses are meaningless, but it seems the modem still
expects an answer.  Which we will not currently provide, because
the NS is addressed to a solicited-node address we don't listen
to.

So this patch seems like a quick-fix to that problem.  But it does
change the semantics of IFF_NOARP, making us reply to NS even if
this flag is set.  Which probably is wrong?

Bjørn

 net/ipv6/addrconf.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index cd3fb30..aa2df3b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1658,7 +1658,7 @@ void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr)
 {
 	struct in6_addr maddr;

-	if (dev->flags&(IFF_LOOPBACK|IFF_NOARP))
+	if (!(dev->flags & IFF_MULTICAST))
 		return;

 	addrconf_addr_solict_mult(addr, &maddr);
@@ -1669,7 +1669,7 @@ void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr)
 {
 	struct in6_addr maddr;

-	if (idev->dev->flags&(IFF_LOOPBACK|IFF_NOARP))
+	if (!(idev->dev->flags & IFF_MULTICAST))
 		return;

 	addrconf_addr_solict_mult(addr, &maddr);
-- 
1.7.10.4

^ permalink raw reply related

* Re: IPv6 path MTU discovery broken
From: Eric Dumazet @ 2013-10-13 16:51 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: Hannes Frederic Sowa, netdev, edumazet, fan.du
In-Reply-To: <20131013104021.GB24967@sesse.net>

On Sun, 2013-10-13 at 12:40 +0200, Steinar H. Gunderson wrote:
> [resending with proper quoting and cc]
> 
> On Mon, Oct 07, 2013 at 05:09:10AM +0200, Hannes Frederic Sowa wrote:
> > Could you try to check (with e.g. nstat) if any of the following counters
> > change if the icmp messages hit the host?
> > 
> > TcpExtOutOfWindowIcmps
> > Icmp6InErrors
> > TcpExtTCPMinTTLDrop
> > TcpExtListenDrops
> 
> Hi,
> 
> I am now seeing the same problem against a 3.9.0 machine (that I've never had
> problems with before). I looked through the four nstat counters, and none of
> them are increasing. Here's /proc/net/ipv6_route from the server while it's
> going on; the client is 2001:67c:a4:1:71f9:c84b:8b2a:c65b (second line) and
> the server is 2001:67c:29f4::2007.

I am not sure, but the skb_is_gso() checks seem too lazy.

Why don't we check that gso_size+headers is below mtu, I dont know.

^ permalink raw reply

* Re: [PATCH 2/3] net: bpf jit: x86: optimize choose_load_func error path
From: Eric Dumazet @ 2013-10-13 16:36 UTC (permalink / raw)
  To: Vladimir Murzin; +Cc: netdev, davem, edumazet, av1474
In-Reply-To: <1381676065-2373-2-git-send-email-murzin.v@gmail.com>

On Sun, 2013-10-13 at 16:54 +0200, Vladimir Murzin wrote:
> Macro CHOOSE_LOAD_FUNC returns handler for "any offset" if checks for K
> were not passed. At the same time handlers for "any offset" cases make
> the same checks against r_addr at run-time, that will always lead to
> bpf_error.
> 
> Run-time checks are still necessary for indirect load operations, but
> error path for absolute and mesh loads are worth to optimize during bpf
> compile time.

I don't get the point.

What real world use case or problem are you trying to handle ?

bpf_error returns 0, so it seems your patch does the same.

A buggy BPF program should not expect us to 'save' a few cycles.

^ permalink raw reply

* Re: [PATCH net 0/4] bridge: Fix problems around the PVID
From: Toshiaki Makita @ 2013-10-13 16:11 UTC (permalink / raw)
  To: vyasevic
  Cc: Toshiaki Makita, David Miller, netdev, Fernando Luis Vazquez Cao,
	Patrick McHardy
In-Reply-To: <525807B9.2060201@redhat.com>

On Fri, 2013-10-11 at 10:14 -0400, Vlad Yasevich wrote:
> On 10/11/2013 03:34 AM, Toshiaki Makita wrote:
> > On Wed, 2013-10-09 at 11:01 -0400, Vlad Yasevich wrote:
> >> On 10/01/2013 07:56 AM, Toshiaki Makita wrote:
> >>> On Mon, 2013-09-30 at 12:01 -0400, Vlad Yasevich wrote:
> >>>> On 09/30/2013 07:46 AM, Toshiaki Makita wrote:
> >>>>> On Fri, 2013-09-27 at 14:10 -0400, Vlad Yasevich wrote:
> >>>>>> On 09/27/2013 01:11 PM, Toshiaki Makita wrote:
> >>>>>>> On Thu, 2013-09-26 at 10:22 -0400, Vlad Yasevich wrote:
> >>>>>>>> On 09/26/2013 06:38 AM, Toshiaki Makita wrote:
> >>>>>>>>> On Tue, 2013-09-24 at 13:55 -0400, Vlad Yasevich wrote:
> >>>>>>>>>> On 09/24/2013 01:30 PM, Toshiaki Makita wrote:
> >>>>>>>>>>> On Tue, 2013-09-24 at 09:35 -0400, Vlad Yasevich wrote:
> >>>>>>>>>>>> On 09/24/2013 07:45 AM, Toshiaki Makita wrote:
> >>>>>>>>>>>>> On Mon, 2013-09-23 at 10:41 -0400, Vlad Yasevich
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> On 09/17/2013 04:12 AM, Toshiaki Makita wrote:
> >>>>>>>>>>>>>>> On Mon, 2013-09-16 at 13:49 -0400, Vlad Yasevich
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> On 09/13/2013 08:06 AM, Toshiaki Makita wrote:
> >>>>>>>>>>>>>>>>> On Thu, 2013-09-12 at 16:00 -0400, David
> >>>>>>>>>>>>>>>>> Miller wrote:
> >>>>>>>>>>>>>>>>>> From: Toshiaki Makita
> >>>>>>>>>>>>>>>>>> <makita.toshiaki@lab.ntt.co.jp> Date: Tue,
> >>>>>>>>>>>>>>>>>> 10 Sep 2013 19:27:54 +0900
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> There seem to be some undesirable
> >>>>>>>>>>>>>>>>>>> behaviors related with PVID. 1. It has no
> >>>>>>>>>>>>>>>>>>> effect assigning PVID to a port. PVID
> >>>>>>>>>>>>>>>>>>> cannot be applied to any frame regardless
> >>>>>>>>>>>>>>>>>>> of whether we set it or not. 2. FDB
> >>>>>>>>>>>>>>>>>>> entries learned via frames applied PVID
> >>>>>>>>>>>>>>>>>>> are registered with VID 0 rather than VID
> >>>>>>>>>>>>>>>>>>> value of PVID. 3. We can set 0 or 4095 as
> >>>>>>>>>>>>>>>>>>> a PVID that are not allowed in IEEE
> >>>>>>>>>>>>>>>>>>> 802.1Q. This leads interoperational
> >>>>>>>>>>>>>>>>>>> problems such as sending frames with VID
> >>>>>>>>>>>>>>>>>>> 4095, which is not allowed in IEEE
> >>>>>>>>>>>>>>>>>>> 802.1Q, and treating frames with VID 0 as
> >>>>>>>>>>>>>>>>>>> they belong to VLAN 0, which is expected
> >>>>>>>>>>>>>>>>>>> to be handled as they have no VID
> >>>>>>>>>>>>>>>>>>> according to IEEE 802.1Q.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Note: 2nd and 3rd problems are potential
> >>>>>>>>>>>>>>>>>>> and not exposed unless 1st problem is
> >>>>>>>>>>>>>>>>>>> fixed, because we cannot activate PVID
> >>>>>>>>>>>>>>>>>>> due to it.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Please work out the issues in patch #2 with
> >>>>>>>>>>>>>>>>>> Vlad and resubmit this series.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm hovering between whether we should fix
> >>>>>>>>>>>>>>>>> the issue by changing vlan 0 interface
> >>>>>>>>>>>>>>>>> behavior in 8021q module or enabling a bridge
> >>>>>>>>>>>>>>>>> port to sending priority-tagged frames, or
> >>>>>>>>>>>>>>>>> another better way.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If you could comment it, I'd appreciate it
> >>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> BTW, I think what is discussed in patch #2 is
> >>>>>>>>>>>>>>>>> another problem about handling priority-tags,
> >>>>>>>>>>>>>>>>> and it exists without this patch set
> >>>>>>>>>>>>>>>>> applied. It looks like that we should prepare
> >>>>>>>>>>>>>>>>> another patch set than this to fix that
> >>>>>>>>>>>>>>>>> problem.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Should I include patches that fix the
> >>>>>>>>>>>>>>>>> priority-tags problem in this patch set and
> >>>>>>>>>>>>>>>>> resubmit them all together?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am thinking that we might need to do it in
> >>>>>>>>>>>>>>>> bridge and it looks like the simplest way to do
> >>>>>>>>>>>>>>>> it is to have default priority regeneration
> >>>>>>>>>>>>>>>> table (table 6-5 from 802.1Q doc).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> That way I think we would conform to the spec.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -vlad
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Unfortunately I don't think the default priority
> >>>>>>>>>>>>>>> regeneration table resolves the problem because
> >>>>>>>>>>>>>>> IEEE 802.1Q says that a VLAN-aware bridge can
> >>>>>>>>>>>>>>> transmit untagged or VLAN-tagged frames only (the
> >>>>>>>>>>>>>>> end of section 7.5 and 8.1.7).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> No mechanism to send priority-tagged frames is
> >>>>>>>>>>>>>>> found as far as I can see the standard. I think
> >>>>>>>>>>>>>>> the regenerated priority is used for outgoing
> >>>>>>>>>>>>>>> PCP field only if egress policy is not untagged
> >>>>>>>>>>>>>>> (i.e. transmitting as VLAN-tagged), and unused if
> >>>>>>>>>>>>>>> untagged (Section 6.9.2 3rd/4th Paragraph).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we want to transmit priority-tagged frames
> >>>>>>>>>>>>>>> from a bridge port, I think we need to implement
> >>>>>>>>>>>>>>> a new (optional) feature that is above the
> >>>>>>>>>>>>>>> standard, as I stated previously.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> How do you feel about adding a per-port policy
> >>>>>>>>>>>>>>> that enables a bridge to send priority-tagged
> >>>>>>>>>>>>>>> frames instead of untagged frames when egress
> >>>>>>>>>>>>>>> policy for the port is untagged? With this
> >>>>>>>>>>>>>>> change, we can transmit frames for a given vlan
> >>>>>>>>>>>>>>> as either all untagged, all priority-tagged or
> >>>>>>>>>>>>>>> all VLAN-tagged.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> That would work.  What I am thinking is that we do
> >>>>>>>>>>>>>> it by special casing the vid 0 egress policy
> >>>>>>>>>>>>>> specification.  Let it be untagged by default and
> >>>>>>>>>>>>>> if it is tagged, then we preserve the priority
> >>>>>>>>>>>>>> field and forward it on.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This keeps the API stable and doesn't require
> >>>>>>>>>>>>>> user/admin from knowing exactly what happens.
> >>>>>>>>>>>>>> Default operation conforms to the spec and allows
> >>>>>>>>>>>>>> simple change to make it backward-compatible.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What do you think.  I've done a simple prototype of
> >>>>>>>>>>>>>> this an it seems to work with the VMs I am testing
> >>>>>>>>>>>>>> with.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Are you saying that - by default, set the 0th bit of
> >>>>>>>>>>>>> untagged_bitmap; and - if we unset the 0th bit and
> >>>>>>>>>>>>> set the "vid"th bit, we transmit frames classified as
> >>>>>>>>>>>>> belonging to VLAN "vid" as priority-tagged?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If so, though it's attractive to keep current API,
> >>>>>>>>>>>>> I'm worried about if it could be a bit confusing and
> >>>>>>>>>>>>> not intuitive for kernel/iproute2 developers that VID
> >>>>>>>>>>>>> 0 has a special meaning only in the egress policy.
> >>>>>>>>>>>>> Wouldn't it be better to adding a new member to
> >>>>>>>>>>>>> struct net_port_vlans instead of using VID 0 of
> >>>>>>>>>>>>> untagged_bitmap?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Or are you saying that we use a new flag in struct
> >>>>>>>>>>>>> net_port_vlans but use the BRIDGE_VLAN_INFO_UNTAGGED
> >>>>>>>>>>>>> bit with VID 0 in netlink to set the flag?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Even in that case, I'm afraid that it might be
> >>>>>>>>>>>>> confusing for developers for the same reason. We are
> >>>>>>>>>>>>> going to prohibit to specify VID with 0 (and 4095) in
> >>>>>>>>>>>>> adding/deleting a FDB entry or a vlan filtering
> >>>>>>>>>>>>> entry, but it would allow us to use VID 0 only when a
> >>>>>>>>>>>>> vlan filtering entry is configured. I am thinking a
> >>>>>>>>>>>>> new nlattr is a straightforward approach to
> >>>>>>>>>>>>> configure it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> By making this an explicit attribute it makes vid 0 a
> >>>>>>>>>>>> special case for any automatic tool that would
> >>>>>>>>>>>> provision such filtering.  Seeing vid 0 would mean that
> >>>>>>>>>>>> these tools would have to know that this would have to
> >>>>>>>>>>>> be translated to a different attribute instead of
> >>>>>>>>>>>> setting the policy values.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, I agree with you that we can do it by the way you
> >>>>>>>>>>> explained. What I don't understand is the advantage of
> >>>>>>>>>>> using vid 0 over another way such as adding a new
> >>>>>>>>>>> nlattr. I think we can indicate transmitting
> >>>>>>>>>>> priority-tags explicitly by such a nlattr. Using vid 0
> >>>>>>>>>>> seems to be easier to implement than a new nlattr, but,
> >>>>>>>>>>> for me, it looks less intuitive and more difficult to
> >>>>>>>>>>> maintain because we have to care about vid 0 instead of
> >>>>>>>>>>> simply ignoring it.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The point I am trying to make is that regardless of the
> >>>>>>>>>> approach someone has to know what to do when enabling
> >>>>>>>>>> priority tagged frames.  You proposal would require the
> >>>>>>>>>> administrator or config tool to have that knowledge.
> >>>>>>>>>> Example is: Admin does: bridge vlan set priority on dev
> >>>>>>>>>> eth0 Automated app: if (vid == 0) /* Turn on priority
> >>>>>>>>>> tagged frame support */
> >>>>>>>>>>
> >>>>>>>>>> My proposal would require the bridge filtering
> >>>>>>>>>> implementation to have it. user tool: bridge vlan add vid 0
> >>>>>>>>>> tagged Automated app:  No special case.
> >>>>>>>>>>
> >>>>>>>>>> IMO its better to have 1 piece code handling the special
> >>>>>>>>>> case then putting it multiple places.
> >>>>>>>>>
> >>>>>>>>> Thank you for the detailed explanation. Now I understand your
> >>>>>>>>> intention.
> >>>>>>>>>
> >>>>>>>>> I have one question about your proposal. I guess the way to
> >>>>>>>>> enable priority-tagged is something like bridge vlan add vid
> >>>>>>>>> 10 dev eth0 bridge vlan add vid 10 dev vnet0 pvid untagged
> >>>>>>>>> bridge vlan add vid 0 dev vnet0 tagged where vnet0 has sub
> >>>>>>>>> interface vnet0.0.
> >>>>>>>>>
> >>>>>>>>> Here the admin have to know the egress policy is applied to a
> >>>>>>>>> frame twice in a certain order when it is transmitted from
> >>>>>>>>> the port vnet0 attached, that is, first, a frame with vid 10
> >>>>>>>>> get untagged, and then, an untagged frame get
> >>>>>>>>> priority-tagged.
> >>>>>>>>>
> >>>>>>>>> This behavior looks difficult to know without previous
> >>>>>>>>> knowledge. Any good idea to avoid such a need for the admin's
> >>>>>>>>> additional knowledge?
> >>>>>>>>
> >>>>>>>> To me, the fact that there is vnet0.0 (or typically, there is
> >>>>>>>> eth0.0 in the guest or on the remote host) already tells the
> >>>>>>>> admin vlan 0 has to be tagged.  The fact that we codify this in
> >>>>>>>> the policy makes it explicit.
> >>>>>>>
> >>>>>>> My worry is that the admin might not be able to guess how to use
> >>>>>>> bridge commands to enable priority-tag without any additional
> >>>>>>> hint in "man bridge", "bridge vlan help", etc. I actually
> >>>>>>> couldn't hit upon such a usage before seeing example commands you
> >>>>>>> gave, because I had never think the egress policy could be
> >>>>>>> applied twice.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> However, I can see strong argument to be made for an addition
> >>>>>>>> egress policy attribute that could be for instance:
> >>>>>>>>
> >>>>>>>> bridge vlan add vid 10 dev eth0 pvid bridge vlan add vid 10 dev
> >>>>>>>> vnet0 pvid untagged prio_tag
> >>>>>>>>
> >>>>>>>> But this has the same connotations as wrt to egress policy.
> >>>>>>>> The 2 policies are applied: (1) untag the frame. (2) add
> >>>>>>>> priority_tag.
> >>>>>>>>
> >>>>>>>> (2) only happens if initial fame received on eth0 was priority
> >>>>>>>> tagged.
> >>>>>>>
> >>>>>>> If we do so, we will not be able to communicate using vlan 0
> >>>>>>> interface under a certain circumstance. Eth0 can be receive mixed
> >>>>>>> untagged and priority-tagged frames according to the network
> >>>>>>> element it is connected to: for example, Open vSwitch can send
> >>>>>>> such two kinds of frames from the same port even if original
> >>>>>>> incoming frames belong to the same vlan.
> >>>>>>
> >>>>>> Which priority would you assign to the frame that was received
> >>>>>> untagged?
> >>>>>
> >>>>> Untagged frame's priority is by default 0, so I think 0 is likely.
> >>>>>
> >>>>> 802.1Q 6.9.1 i) The received priority value and the drop_eligible
> >>>>> parameter value are the values in the M_UNITDATA.indication.
> >>>>>
> >>>>> M_UNITDATA is passed from ISS.
> >>>>>
> >>>>> 802.1Q 6.7.1 The priority parameter provided in a data indication
> >>>>> primitive shall take the value of the Default User Priority parameter
> >>>>> for the Port through which the MAC frame was received. The default
> >>>>> value of this parameter is 0, it may be set by management in which
> >>>>> case the capability to set it to any of the values 0 through 7 shall
> >>>>> be provided.
> >>>>>
> >>>>>>
> >>>>>>> In this situation, we can only receive frames that is
> >>>>>>> priority-tagged when received on eth0.
> >>>>>>
> >>>>>> Not sure I understand.  Let's look at this config: bridge vlan add
> >>>>>> vid 10 dev eth0 pvid bridge vlan add vid 10 dev vnet0 pvid untagged
> >>>>>> prio_tag
> >>>>>>
> >>>>>> Here, eth0 is allowed to receive vid 10 tagged, untagged, and
> >>>>>> prio_tagged (if we look at the patch 2 from this set). Now, frame
> >>>>>> is forwarded to vnet0. 1) if the frame had vid 10 in the tag or was
> >>>>>> untagged, it should probably be sent untagged. 2) if the frame had
> >>>>>> a priority tag, it should probably be sent as such.
> >>>>>>
> >>>>>> Now, I think a case could be made that if the frame had any
> >>>>>> priority markings in the vlan header, we should try to preserve
> >>>>>> those markings if prio_tag is turned on.  We can assume value of 0
> >>>>>> means not set.
> >>>>>
> >>>>> If we don't insert prio_tag when PCP is 0, we might receive mixed
> >>>>> priority-tagged and untagged frames on eth0.
> >>>>
> >>>> Right, and that's what you were trying to handle in your patch:
> >>>>
> >>>>> +		/* PVID is set on this port.  Any untagged or priority-tagged +
> >>>>> * ingress frame is considered to belong to this vlan. */
> >>>>
> >>>> So, in this case we are prepared to handle the "mixed" scenario on ingress.
> >>>>
> >>>>> Even if we are sending frames from eth0.0 with some priority other
> >>>>> than 0, we could receive frames with priority 0 or untagged on the
> >>>>> other side of the bridge.
> >>>>> For example, if we receive untagged arp reply on the bridge port, we
> >>>>> migit not be able to communicate with such an end station, because
> >>>>> untagged reply will not be passed to eth0.0.
> >>>>
> >>>> So the ARP request was sent tagged, but the reply came back untagged?
> >>>
> >>> Yes, it can happen.
> >>> These are problematic cases.
> >>>
> >>> Example 1:
> >>>               prio_tagged         prio_tagged
> >>> +------------+ ---> +------------+ ---> +----------+
> >>> |guest eth0.0|------|host1 Bridge|------|host2 eth0|
> >>> +------------+ <--- +------------+ <--- +----------+
> >>>                untagged            untagged
> >>>
> >>> Note: Host2 eth0, which is an interface on Linux, can receive
> >>> priority-tagged frames if it doesn't have vlan 0 interface (eth0.0).
> >>
> >> Hmm..  Just to see if this works, I ran the this scenario with
> >> a dumb switch in the middle, and it did not work as you noted.
> >> I then realized that one of the kernels was rather old and after
> >> updating it, behaved differently.  The communication still didn't
> >> work, but host2 behaved properly.
> >>
> >>>>
> >>>> How does that work when the end station is attached directly to the
> >>>> HW switch instead of a linux bridge?
> >>>>
> >>>> The station configures eth0.0 and sends priority-tagged traffic to
> >>>> the HW switch.  If the HW switch sends back untagged traffic, then
> >>>> the untagged traffic will never reach eth0.0.
> >>>
> >>> Currently we cannot communicate using eth0.0 via directly connected
> >>> 802.1Q conformed switch, because we never receive priority-tagged frames
> >>> from the switch.
> >>> It is not a problem of Linux bridge and is why I wondered whether it
> >>> should be fixed by bridge or vlan 0 interface.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> IMO, if prio_tag is configured, the bridge should send any
> >>>>>>> untagged frame as priority-tagged regardless of whatever it is on
> >>>>>>> eth0.
> >>>>>>
> >>>>>> Which priority would you use, 0?  You are not guaranteed to
> >>>>>> properly deliver the traffic then for a configuration such as: VM:
> >>>>>> eth0: 10.0.0.1/24 eth0.0: 10.0.1.1/24
> >>>>>
> >>>>> I'd like to use priority 0 for untagged frames.
> >>>>>
> >>>>> I am assuming that one of our goals is at least that eth0.0 comes to
> >>>>> be able to communicate with another end station. It seems to be hard
> >>>>> to use both eth0 and eth0.0 simultaneously.
> >>>>
> >>>> I understand, but I don't agree that we should always tag.
> >>>>
> >>>> Consider config:
> >>>>
> >>>>       hw switch <---> (eth0: Linux Bridge: eth1) <--- (em1.0:end station)
> >>>>
> >>>> If the end station sends priority-tagged traffic it should receive
> >>>> priority tagged traffic back.  Otherwise, untagged traffic may be
> >>>> dropped by the end station.  This is true whether it is connected to
> >>>> the hw switch or Linux bridge.
> >>>
> >>> Though such a behavior is generally not necessary as far as I can read
> >>> 802.1Q spec, it is essential for vlan 0 interface on Linux, I think.
> >>> My proposal aims to resolve it at least when we use Linux bridge.
> >>>
> >>> Example configuration:
> >>> 	bridge vlan add vid 10 dev eth1 pvid untagged
> >>> 	bridge vlan add vid 10 dev eth0
> >>> 	bridge vlan set prio_tag on dev eth1
> >>>
> >>> Intended behavior:
> >>>
> >>>           VID10-tagged                     prio_tagged
> >>> +---------+ <--- +------------------------+ <--- +-----------------+
> >>> |hw switch|------|eth0: Linux Bridge: eth1|------|em1.0:end station|
> >>> +---------+ ---> +------------------------+ ---> +-----------------+
> >>>           VID10-tagged                     prio_tagged
> >>>                                 (always if egress policy untagged)
> >>
> >> Ok, I think you've convinced me that this is the right approach. The
> >> only thing that I am not crazy about is the API.  I'd almost want to
> >> introduce a new flag that can be set in a 'vlan add' or 'vlan set'
> >> command that communicates a new policy.
> >
> > I'm glad that we reached a consensus on the approach :)
> >
> > I agree with you that the API is flag based.
> > I'm guessing your intention is that 'vlan add' means a per vlan per port
> > policy and 'vlan set' means a per port one, that is,
> > 	'vlan add': bridge vlan add vid 10 dev eth1 prio_tag
> > 	'vlan set': bridge vlan set prio_tag on dev eth1
> >
> > I think they can behave differently only when we set untagged to
> > multiple vlans on the same port.
> >
> > 'vlan add' example with vid 10 and 20:
> > 	bridge vlan add vid 10 dev eth1 pvid untagged prio_tag
> > 	bridge vlan add vid 10 dev eth0
> > 	bridge vlan add vid 20 dev eth1 untagged
> > 	bridge vlan add vid 20 dev eth2
> >
> >           VID10-tagged                  prio_tagged (from eth0)
> > +---------+ ---> +------------------------+ ---> +-----------------+
> > |hw switch|------|eth0                eth1|------|em1.0:end station|
> > +---------+      |      Linux Bridge      | ---> +-----------------+
> > +---------+      |                        | *untagged*
> > |hw switch|------|eth2                    | (from eth2)
> > +---------+ ---> +------------------------+
> >           VID20-tagged
> >
> 
> This is what I was thinking of, but I was actually considering that
> untagged and prio_tag can not co-exist for the same vlan as they don't
> really make sence together anymore.

You're right.
In this case 'untagged' for 'vid 10' is no longer necessary.

> 
> So one can do:
> 	bridge vlan add vid 10 dev eth1 pvid prio_tag
> 	bridge vlan add vid 20 dev eth1 untagged
> 
> and recieve VLAN 10 as priority tagged and vlan 20 as untagged.

Can you make a patch set implementing this?

I'd like to re-send this patch set related to PVID with more comments
about the unresolved vlan 0 interface problem and the prospect that it
will be addressed by another patch set of yours.

Is this procedure OK with you?

Thanks,

Toshiaki Makita

> 
> -vlad
> 
> >
> > 'vlan set' example with vid 10 and 20:
> > 	bridge vlan add vid 10 dev eth1 pvid untagged
> > 	bridge vlan add vid 10 dev eth0
> > 	bridge vlan add vid 20 dev eth1 untagged
> > 	bridge vlan add vid 20 dev eth2
> > 	bridge vlan set prio_tag on dev eth1
> >
> >           VID10-tagged                  prio_tagged (from eth0)
> > +---------+ ---> +------------------------+ ---> +-----------------+
> > |hw switch|------|eth0                eth1|------|em1.0:end station|
> > +---------+      |      Linux Bridge      | ---> +-----------------+
> > +---------+      |                        | prio_tagged
> > |hw switch|------|eth2                    | (from eth2)
> > +---------+ ---> +------------------------+
> >           VID20-tagged
> >
> > Em1.0 can always receive traffic from eth1 if we adopt 'vlan set'.
> > However, I cannot imagine when multiple untagged vlans is required, so
> > cannot figure out whether 'vlan add' is useful or harmful.
> > Anyway, both of approaches are OK with me.
> >
> > Thanks,
> > Toshiaki Makita
> >
> >>
> >> Thanks
> >> -vlad
> >>
> >>>
> >>> Thanks,
> >>>
> >>> Toshiaki Makita
> >>>
> >>>>
> >>>> -vlad
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Toshiaki Makita
> >>>>>
> >>>>>>
> >>>>>> -vlad
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I think I am ok with either approach.  Explicit vid 0 policy is
> >>>>>>>> easier for automatic provisioning.   The flag based one is
> >>>>>>>> easier for admin/ manual provisioning.
> >>>>>>>
> >>>>>>> Supposing we have to add something to help or man in any case, I
> >>>>>>> think flag based is better. The format below seems to suitable
> >>>>>>> for a per-port policy. bridge vlan set prio_tag on dev vnet0
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Toshiaki Makita
> >>>>>>>
> >>>>>>>>
> >>>>>>>> -vlad.
> >>>>>>>>
> >>>>>>>> -vlad
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks -vlad
> >>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> How it is implemented internally in the kernel isn't as
> >>>>>>>>>>>> big of an issue. We can do it as a separate flag or as
> >>>>>>>>>>>> part of existing policy.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -vlad
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -vlad
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -- To unsubscribe from this list: send the
> >>>>>>>>>>>>>>>>>> line "unsubscribe netdev" in the body of a
> >>>>>>>>>>>>>>>>>> message to majordomo@vger.kernel.org More
> >>>>>>>>>>>>>>>>>> majordomo info at
> >>>>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -- To unsubscribe from this list: send the line
> >>>>>>>>>>>>>>> "unsubscribe netdev" in the body of a message to
> >>>>>>>>>>>>>>> majordomo@vger.kernel.org More majordomo info at
> >>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox