Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Bonding
From: Jay Vosburgh @ 2014-02-11 17:55 UTC (permalink / raw)
  To: Gustavo Pimentel; +Cc: Veaceslav Falico, netdev@vger.kernel.org
In-Reply-To: <8532C3BD1ECBC64BA47CE43AFED6D38EC37E5B04@S103.efacec.pt>

Gustavo Pimentel <gustavo.pimentel@efacec.com> wrote:

>Hi Veaceslav,
>
>It's quite different from broadcast mode, each frame sent through the slaves has attached a Redundancy Control Trailer also known as RCT (this trailer is compose by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
>Also the equipment with PRP capability has to send periodically a supervision frame to both similar LANs. Each device on the network has to keep track of receive sequence numbers received, if the received a sequence number for instance from LAN A of specific device and it doesn't exist on internal table, the device should accept the frame and update the internal table. When receiving the same sequence number from the LAN B, the device should discard it, providing zero downtime redundancy.
>
>I can supply you information about this redundancy protocol, if you like. This type of network redundancy is now being large deployed on electrical power stations (like thermal and hydro) and transmission power stations instead of teaming / bonding that depends on RSTP for redundancy.

	Are you aware that there is already an implementation of HSR
(High-availability Seamless Redundancy) in the linux kernel?  I believe
HSR and PRP are defined by the same standard (IEC 62439-3), and are
similar enough to interoperate to some degree.  Perhaps PRP would be
better implemented as a variant within the existing net/hsr/ framework.

	-J


>> -----Original Message-----
>> From: Veaceslav Falico [mailto:vfalico@redhat.com]
>> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
>> To: Gustavo Pimentel
>> Cc: netdev@vger.kernel.org
>> Subject: Re: Bonding
>> 
>> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
>> >Hi,
>> 
>> Hi Gustavo,
>> 
>> >
>> >I'm writing you because because I'm have implemented a new mode (PRP Parallel
>> Redundancy Protocol) for bonding kernel driver. This new mode is quite simple, I
>> don't know if you have heard about PRP, but it's a new standard that allows to
>> overcome any single network failure without affecting the data transmission. The
>> general idea resides on having two separate LAN (A & B) very similar and
>> transmitting the almost the same frame through both LANs and the end device
>> should accept one frame and discard the other according to a known mechanism.
>> 
>> Isn't that the current 'broadcast' mode, where every packet is transmitted over all
>> the slaves? After quick googling/reading I don't see any difference there, though I
>> might have missed something.
>> 
>> >
>> >I have implemented this new mode on bonding driver, but I have some
>> difficulties:
>> >. Writing linux driver is quite new for me. I don't' know if exists guide lines for
>> driver coding.
>> 
>> You can find everything under Documentation/, but without the code I can't tell you
>> exact documents. CodingStyle and SubmittingPatches might be the first ones.
>> 
>> Also, try CC-ing relevant people for more feedback, specifically bonding
>> maintainers.
>> 
>> >. I don't know how to submit the code to be include on kernel repository.
>> >. Maybe another pair of eyes could find help to improve the writing code for this
>> mode.
>> 
>> Try sending an RFC when net-next opens.
>> 
>> >
>> >I think my driver code is 99% complete. I'm currently testing with 3 equipments (1
>> pc + 1 embedded device running both my modify bonding driver) and a third party
>> equipment called RedBox.
>> >
>> >Would you be interested in participating / helping this project?
>> >
>> >With my best regards,
>> >
>> >Gustavo Gama da Rocha Pimentel
>> >Power Systems Automation / Innovation & Development Efacec Engenharia e
>> >Sistemas, S.A.
>> >Phone: +351229403391
>> >Disclaimer

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Vlad Yasevich @ 2014-02-11 18:21 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA58E9.906@mojatatu.com>

On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
> On 02/10/14 11:31, Vlad Yasevich wrote:
>> On 02/09/2014 10:06 AM, Jamal Hadi Salim wrote:
> 
> 
>>> +    ndm = nlmsg_data(cb->nlh);
>>> +    if (ndm->ndm_ifindex) {
>>
>> We get really lucky here that ndm_ifindex and ifi_index happen to map to
>> the same location.
>>
> 
> Didnt follow - but I have a feeling you are looking at the reference
> point of a bridge port.
> Note as per my response to John: The target is a bridge device, not
> a bridge port.
> 

No, this was more the point that the current iproute code sends an
ifinfomsg struct down, and you change that to send ndmsg struct.
This is risky, but we luck out since the index is at the same offset
in both structs.

> 
> 
>>
>> I agree with both of Johns commens fro the above code.
>> I think you can use ndo_dflt_fdb_dump() here and remove the first check
>> for IFF_EBRIDGE.
>>
> 
> Same comment i made to John. The goal is to emulate
> brctl showmacs <bridge>
> ndo_dflt_fdb_dump() gives me in theory all the bridge ports
> unicast and multicast MAC addresses. There is a posibility that
> the bridgeport is a bridge - in which case I can find out from
> user space and safely request for it directly instead of via
> its parent.
> 

But that would only happen if the user said:
  # bridge fdb show br eth0

If eth0 in this case is a hw bridge device, getting the device's
version of fdb data is exactly what would be expected, isn't it?

If you mean a 'software bridge' above, then that's not an issue
since that's a disallowed config.  You can't stack software bridges
without something in the middle like bond or vlan.

>> The only odd thing is that it would permit syntax like:
>>   # bridge fbd show br eth0
>> or
>>   # bridge fdb show br macvlan0
>>
>> but I think that's ok.
> 
> Ok, since both you and John point to macvlan - is that
> considered as something with an fdb? It doesnt forward
> packets between two devices.
> 

Yes, macvlan can forward data to other macvlans, but that's
not the interesting thing.
When you configure multiple macvlan devices on top of the
same hw device, one could think of the hw device as a sort
of a bridge.  It's not really, but you could define it in
those terms.  The fdb entries, in this case, contain the mac
addresses of the macvlan devices.

> 
> 
>>> diff --git a/bridge/fdb.c b/bridge/fdb.c
>>> index e2e53f1..f3073d6 100644
>>> --- a/bridge/fdb.c
>>> +++ b/bridge/fdb.c
>>> @@ -33,7 +33,7 @@ static void usage(void)
>>>       fprintf(stderr, "Usage: bridge fdb { add | append | del |
>>> replace }
>> ADDR dev DEV {self|master} [ temp ]\n"
>>>                   "              [router] [ dst IPADDR] [ vlan VID ]\n"
>>>                   "              [ port PORT] [ vni VNI ] [via DEV]\n");
>>> -    fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
>>> +    fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV
>>> ]\n");
>>
>> 'port' option is now allowed in the show operation
>>
> 
> Thanks - it is already taken seems by vxlan using the same interface.
> 

Sorry, I wasn't very clear. What I meant was that you now support
  # bridge fdb show port <>

The usage message should reflect it.

-vlad
> 
> cheers,
> jamal
> 

^ permalink raw reply

* RE: Bonding
From: Gustavo Pimentel @ 2014-02-11 18:22 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Veaceslav Falico, netdev@vger.kernel.org
In-Reply-To: <27459.1392141335@death.nxdomain>

Hi Jay,

I was not aware of that. You are correct, both HSR and PRP are defined on IEC 62439-3. I will try to contact the person in charge of HSR, to acquire more information about his driver status.


> -----Original Message-----
> From: Jay Vosburgh [mailto:fubar@us.ibm.com]
> Sent: terça-feira, 11 de Fevereiro de 2014 17:56
> To: Gustavo Pimentel
> Cc: Veaceslav Falico; netdev@vger.kernel.org
> Subject: Re: Bonding
> 
> Gustavo Pimentel <gustavo.pimentel@efacec.com> wrote:
> 
> >Hi Veaceslav,
> >
> >It's quite different from broadcast mode, each frame sent through the slaves has
> attached a Redundancy Control Trailer also known as RCT (this trailer is compose
> by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
> >Also the equipment with PRP capability has to send periodically a supervision
> frame to both similar LANs. Each device on the network has to keep track of
> receive sequence numbers received, if the received a sequence number for
> instance from LAN A of specific device and it doesn't exist on internal table, the
> device should accept the frame and update the internal table. When receiving the
> same sequence number from the LAN B, the device should discard it, providing
> zero downtime redundancy.
> >
> >I can supply you information about this redundancy protocol, if you like. This
> type of network redundancy is now being large deployed on electrical power
> stations (like thermal and hydro) and transmission power stations instead of
> teaming / bonding that depends on RSTP for redundancy.
> 
> 	Are you aware that there is already an implementation of HSR (High-
> availability Seamless Redundancy) in the linux kernel?  I believe HSR and PRP are
> defined by the same standard (IEC 62439-3), and are similar enough to interoperate
> to some degree.  Perhaps PRP would be better implemented as a variant within the
> existing net/hsr/ framework.
> 
> 	-J
> 
> 
> >> -----Original Message-----
> >> From: Veaceslav Falico [mailto:vfalico@redhat.com]
> >> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
> >> To: Gustavo Pimentel
> >> Cc: netdev@vger.kernel.org
> >> Subject: Re: Bonding
> >>
> >> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
> >> >Hi,
> >>
> >> Hi Gustavo,
> >>
> >> >
> >> >I'm writing you because because I'm have implemented a new mode (PRP
> >> >Parallel
> >> Redundancy Protocol) for bonding kernel driver. This new mode is
> >> quite simple, I don't know if you have heard about PRP, but it's a
> >> new standard that allows to overcome any single network failure
> >> without affecting the data transmission. The general idea resides on
> >> having two separate LAN (A & B) very similar and transmitting the
> >> almost the same frame through both LANs and the end device should accept
> one frame and discard the other according to a known mechanism.
> >>
> >> Isn't that the current 'broadcast' mode, where every packet is
> >> transmitted over all the slaves? After quick googling/reading I don't
> >> see any difference there, though I might have missed something.
> >>
> >> >
> >> >I have implemented this new mode on bonding driver, but I have some
> >> difficulties:
> >> >. Writing linux driver is quite new for me. I don't' know if exists
> >> >guide lines for
> >> driver coding.
> >>
> >> You can find everything under Documentation/, but without the code I
> >> can't tell you exact documents. CodingStyle and SubmittingPatches might be
> the first ones.
> >>
> >> Also, try CC-ing relevant people for more feedback, specifically
> >> bonding maintainers.
> >>
> >> >. I don't know how to submit the code to be include on kernel repository.
> >> >. Maybe another pair of eyes could find help to improve the writing
> >> >code for this
> >> mode.
> >>
> >> Try sending an RFC when net-next opens.
> >>
> >> >
> >> >I think my driver code is 99% complete. I'm currently testing with 3
> >> >equipments (1
> >> pc + 1 embedded device running both my modify bonding driver) and a
> >> third party equipment called RedBox.
> >> >
> >> >Would you be interested in participating / helping this project?
> >> >
> >> >With my best regards,
> >> >
> >> >Gustavo Gama da Rocha Pimentel
> >> >Power Systems Automation / Innovation & Development Efacec
> >> >Engenharia e Sistemas, S.A.
> >> >Phone: +351229403391
> >> >Disclaimer
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 


^ permalink raw reply

* Re: [PATCH 3/3] net: GSO encapsulation for IP packets
From: Alexei Starovoitov @ 2014-02-11 19:12 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, netdev, ogerlitz
In-Reply-To: <alpine.DEB.2.02.1402110928030.7010@tomh.mtv.corp.google.com>

On Tue, Feb 11, 2014 at 9:43 AM, Tom Herbert <therbert@google.com> wrote:
> The UDP GSO code assume that only encapsulated packets are Ethernet
> frames. This patch fixes that so that we can support IP protocol
> encpasulation (GUE, GRE/UDP, etc.)
>
> We overload the inner_protocol field in the skb to store either the
> Ethertype or the IP protocol (latter is indicated by ip_encapsulation
> bit). As far as I can tell this should not adversely affect preexiting
> uses for inner_protocol.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/skbuff.h |  8 ++++++--
>  net/core/skbuff.c      |  1 +
>  net/ipv4/udp.c         | 12 +++++++++++-
>  3 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1f689e6..757ed39 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -512,7 +512,11 @@ struct sk_buff {
>          * headers if needed
>          */
>         __u8                    encapsulation:1;
> -       /* 6/8 bit hole (depending on ndisc_nodetype presence) */
> +       /* skbuf encpasulates an IP packet, inner_protocol should be
> +        * interpreted as an IP protocol, encapsulation bit is also set
> +        */
> +       __u8                    ip_encapsulation:1;
> +       /* 5/7 bit hole (depending on ndisc_nodetype presence) */
>         kmemcheck_bitfield_end(flags2);
>
>  #if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
> @@ -530,7 +534,7 @@ struct sk_buff {
>                 __u32           reserved_tailroom;
>         };
>
> -       __be16                  inner_protocol;
> +       __u16                   inner_protocol;
>         __u16                   inner_transport_header;
>         __u16                   inner_network_header;
>         __u16                   inner_mac_header;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 8f519db..64c6190 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -687,6 +687,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>         new->ooo_okay           = old->ooo_okay;
>         new->no_fcs             = old->no_fcs;
>         new->encapsulation      = old->encapsulation;
> +       new->ip_encapsulation   = old->ip_encapsulation;
>  #ifdef CONFIG_XFRM
>         new->sp                 = secpath_get(old->sp);
>  #endif
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 77bd16f..48d8cb2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2497,7 +2497,17 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
>
>         /* segment inner packet. */
>         enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
> -       segs = skb_mac_gso_segment(skb, enc_features);
> +
> +       if (skb->ip_encapsulation) {
> +               const struct net_offload *ops;
> +               ops = rcu_dereference(inet_offloads[skb->inner_protocol]);
> +               if (likely(ops && ops->callbacks.gso_segment))
> +                       segs = ops->callbacks.gso_segment(skb, enc_features);
> +       } else {
> +               skb->protocol = htons(ETH_P_TEB);

duplicate assignment ? Do you want to remove line 2496 which did the same
or proto=teb applies to ip_encap case as well?

> +               segs = skb_mac_gso_segment(skb, enc_features);
> +       }
> +
>         if (!segs || IS_ERR(segs)) {
>                 skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
>                                      mac_len);
> --
> 1.9.0.rc1.175.g0b1dcb5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: 3.14-mw regression: rtl8169 WARNING: DMA-API: exceeded 7 overlapping mappings of pfn 55ebe
From: Sander Eikelenboom @ 2014-02-11 19:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Konrad Rzeszutek Wilk, Wei Liu, Francois Romieu,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAPcyv4g2EnCLFWfLfSDnjtwrid2tCq2k6wh-8sPYY06eJpM83A@mail.gmail.com>

Hi Dan,

FYI just tested and put Xen out of the equation (booting baremetal) and it still persists.

I tried something else .. don't know if it gives you anymore insights, but it's worth the try:

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 2defd13..0fe5b75 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -474,11 +474,11 @@ static int active_pfn_set_overlap(unsigned long pfn, int overlap)
        return overlap;
 }

-static void active_pfn_inc_overlap(unsigned long pfn)
+static void active_pfn_inc_overlap(struct dma_debug_entry *ent)
 {
-       int overlap = active_pfn_read_overlap(pfn);
+       int overlap = active_pfn_read_overlap(ent->pfn);

-       overlap = active_pfn_set_overlap(pfn, ++overlap);
+       overlap = active_pfn_set_overlap(ent->pfn, ++overlap);

        /* If we overflowed the overlap counter then we're potentially
         * leaking dma-mappings.  Otherwise, if maps and unmaps are
@@ -486,15 +486,43 @@ static void active_pfn_inc_overlap(unsigned long pfn)
         * debug_dma_assert_idle() as the pfn may be marked idle
         * prematurely.
         */
+
        WARN_ONCE(overlap > ACTIVE_PFN_MAX_OVERLAP,
                  "DMA-API: exceeded %d overlapping mappings of pfn %lx\n",
-                 ACTIVE_PFN_MAX_OVERLAP, pfn);
+                 ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
+
+       if(overlap > ACTIVE_PFN_MAX_OVERLAP){
+
+               dev_info(ent->dev, "DMA-API: exceeded %d overlapping mappings of pfn %lx .. start dump\n", ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
+               int idx;
+
+               for (idx = 0; idx < HASH_SIZE; idx++) {
+                    struct hash_bucket *bucket = &dma_entry_hash[idx];
+                    struct dma_debug_entry *entry;
+                   unsigned long flags;
+
+                    list_for_each_entry(entry, &bucket->list, list) {
+                                       if (entry->pfn == ent->pfn) {
+                                           dev_info(entry->dev, "%s idx %d P=%Lx N=%lx D=%Lx L=%Lx %s %s\n",
+                                                type2name[entry->type], idx,
+                                                phys_addr(entry), entry->pfn,
+                                                entry->dev_addr, entry->size,
+                                                dir2name[entry->direction],
+                                               maperr2str[entry->map_err_type]);
+                                       }
+                    }
+               }
+               dev_info(ent->dev, "DMA-API: exceeded %d overlapping mappings of pfn %lx .. end of dump\n", ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
+       }
 }


@@ -505,10 +533,10 @@ static int active_pfn_insert(struct dma_debug_entry *entry)

        spin_lock_irqsave(&radix_lock, flags);
        rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
-       if (rc == -EEXIST)
-               active_pfn_inc_overlap(entry->pfn);
+       if (rc == -EEXIST){
+               active_pfn_inc_overlap(entry);
+       }
        spin_unlock_irqrestore(&radix_lock, flags);
-
        return rc;
 }


This results in:
[   27.708678] r8169 0000:0a:00.0 eth1: link down
[   27.712102] r8169 0000:0a:00.0 eth1: link down
[   28.015340] r8169 0000:0b:00.0 eth0: link down
[   28.015368] r8169 0000:0b:00.0 eth0: link down
[   29.654844] r8169 0000:0b:00.0 eth0: link up
[   30.278542] r8169 0000:0a:00.0 eth1: link up
[   60.829503] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   69.708979] EXT4-fs (dm-42): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   76.128678] EXT4-fs (dm-43): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   82.922836] EXT4-fs (dm-44): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   89.232889] EXT4-fs (dm-45): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   95.359859] EXT4-fs (dm-46): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[  101.638559] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[  218.073407] ------------[ cut here ]------------
[  218.080983] WARNING: CPU: 5 PID: 0 at lib/dma-debug.c:492 add_dma_entry+0xf1/0x210()
[  218.088550] DMA-API: exceeded 7 overlapping mappings of pfn 3c421
[  218.095988] Modules linked in:
[  218.103270] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W    3.14.0-rc2-20140211-pcireset-net-btrevert-xenblock-dmadebug5+ #1
[  218.110712] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
[  218.118134]  0000000000000009 ffff88003fd437b8 ffffffff81b809c4 ffff88003e308000
[  218.125556]  ffff88003fd43808 ffff88003fd437f8 ffffffff810c985c 0000000000000000
[  218.132917]  00000000ffffffef 0000000000000036 ffff88003d9d3c00 0000000000000282
[  218.140154] Call Trace:
[  218.147193]  <IRQ>  [<ffffffff81b809c4>] dump_stack+0x46/0x58
[  218.154271]  [<ffffffff810c985c>] warn_slowpath_common+0x8c/0xc0
[  218.161293]  [<ffffffff810c9946>] warn_slowpath_fmt+0x46/0x50
[  218.168227]  [<ffffffff814f2cfa>] ? active_pfn_read_overlap+0x3a/0x70
[  218.175116]  [<ffffffff814f41d1>] add_dma_entry+0xf1/0x210
[  218.181865]  [<ffffffff814f4646>] debug_dma_map_page+0x126/0x150
[  218.188484]  [<ffffffff817aabeb>] rtl8169_start_xmit+0x21b/0xa20
[  218.195042]  [<ffffffff81a01877>] ? dev_queue_xmit_nit+0x1d7/0x260
[  218.201553]  [<ffffffff81a0188f>] ? dev_queue_xmit_nit+0x1ef/0x260
[  218.207965]  [<ffffffff81a016a5>] ? dev_queue_xmit_nit+0x5/0x260
[  218.214290]  [<ffffffff81a0661f>] dev_hard_start_xmit+0x37f/0x590
[  218.220481]  [<ffffffff81a26cae>] sch_direct_xmit+0xfe/0x280
[  218.226529]  [<ffffffff81a06a7f>] __dev_queue_xmit+0x24f/0x660
[  218.232521]  [<ffffffff81a06835>] ? __dev_queue_xmit+0x5/0x660
[  218.238439]  [<ffffffff81ab21b9>] ? ip_output+0x59/0xf0
[  218.244272]  [<ffffffff81a06eb0>] dev_queue_xmit+0x10/0x20
[  218.250043]  [<ffffffff81ab076b>] ip_finish_output+0x2cb/0x670
[  218.255682]  [<ffffffff81ab21b9>] ? ip_output+0x59/0xf0
[  218.261168]  [<ffffffff81ab21b9>] ip_output+0x59/0xf0
[  218.266559]  [<ffffffff81aad596>] ip_forward_finish+0x76/0x1a0
[  218.271883]  [<ffffffff81aad86b>] ip_forward+0x1ab/0x440
[  218.277148]  [<ffffffff81aab380>] ip_rcv_finish+0x150/0x660
[  218.282373]  [<ffffffff81aabe3b>] ip_rcv+0x22b/0x370
[  218.287436]  [<ffffffff81b09bc7>] ? packet_rcv_spkt+0x47/0x190
[  218.292372]  [<ffffffff81a03272>] __netif_receive_skb_core+0x722/0x8f0
[  218.297328]  [<ffffffff81a02c75>] ? __netif_receive_skb_core+0x125/0x8f0
[  218.302304]  [<ffffffff8112ce6e>] ? getnstimeofday+0xe/0x30
[  218.307296]  [<ffffffff819f42c5>] ? __netdev_alloc_frag+0x175/0x1b0
[  218.312166]  [<ffffffff81a03461>] __netif_receive_skb+0x21/0x70
[  218.316904]  [<ffffffff81a034d3>] netif_receive_skb_internal+0x23/0xf0
[  218.321596]  [<ffffffff81a04d2d>] napi_gro_receive+0x8d/0x100
[  218.326219]  [<ffffffff817a7bc3>] rtl8169_poll+0x2d3/0x680
[  218.330754]  [<ffffffff8112e366>] ? update_wall_time+0x356/0x690
[  218.335208]  [<ffffffff81a03a0a>] net_rx_action+0x18a/0x2c0
[  218.339595]  [<ffffffff810ce6f1>] ? __do_softirq+0xc1/0x300
[  218.343890]  [<ffffffff810ce767>] __do_softirq+0x137/0x300
[  218.348085]  [<ffffffff810cec9a>] irq_exit+0xaa/0xd0
[  218.352203]  [<ffffffff81b8e5a7>] do_IRQ+0x67/0x110
[  218.356225]  [<ffffffff81b8b772>] common_interrupt+0x72/0x72
[  218.360156]  <EOI>  [<ffffffff810536e6>] ? native_safe_halt+0x6/0x10
[  218.364087]  [<ffffffff81113a7d>] ? trace_hardirqs_on+0xd/0x10
[  218.367935]  [<ffffffff81020632>] default_idle+0x32/0xd0
[  218.371691]  [<ffffffff8102071e>] amd_e400_idle+0x4e/0x140
[  218.375360]  [<ffffffff81020f86>] arch_cpu_idle+0x36/0x40
[  218.378921]  [<ffffffff81120a01>] cpu_startup_entry+0xa1/0x2a0
[  218.382508]  [<ffffffff810473cf>] start_secondary+0x1af/0x210
[  218.386133] ---[ end trace 0e12f271209e2c18 ]---
[  218.389769] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c421 .. start dump
[  218.393566] r8169 0000:0b:00.0: single idx 563 P=3c421100 N=3c421 D=c66100 L=36 DMA_TO_DEVICE dma map error checked
[  218.397379] r8169 0000:0b:00.0: single idx 563 P=3c4212c0 N=3c421 D=c672c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.401094] r8169 0000:0b:00.0: single idx 564 P=3c421480 N=3c421 D=c68480 L=36 DMA_TO_DEVICE dma map error checked
[  218.404730] r8169 0000:0b:00.0: single idx 564 P=3c421640 N=3c421 D=c69640 L=36 DMA_TO_DEVICE dma map error checked
[  218.408310] r8169 0000:0b:00.0: single idx 565 P=3c421800 N=3c421 D=c6a800 L=36 DMA_TO_DEVICE dma map error checked
[  218.411762] r8169 0000:0b:00.0: single idx 565 P=3c4219c0 N=3c421 D=c6b9c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.415075] r8169 0000:0b:00.0: single idx 566 P=3c421b80 N=3c421 D=c6cb80 L=9b DMA_TO_DEVICE dma map error checked
[  218.418305] r8169 0000:0b:00.0: single idx 566 P=3c421dc0 N=3c421 D=c6ddc0 L=36 DMA_TO_DEVICE dma map error checked
[  218.421502] r8169 0000:0b:00.0: single idx 567 P=3c421f80 N=3c421 D=c6ef80 L=36 DMA_TO_DEVICE dma map error not checked
[  218.424677] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c421 .. end of dump
[  218.429050] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c423 .. start dump
[  218.432225] r8169 0000:0b:00.0: single idx 571 P=3c423040 N=3c423 D=c76040 L=36 DMA_TO_DEVICE dma map error checked
[  218.435408] r8169 0000:0b:00.0: single idx 571 P=3c423200 N=3c423 D=c77200 L=36 DMA_TO_DEVICE dma map error checked
[  218.438578] r8169 0000:0b:00.0: single idx 572 P=3c4233c0 N=3c423 D=c783c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.441695] r8169 0000:0b:00.0: single idx 572 P=3c423580 N=3c423 D=c79580 L=7b DMA_TO_DEVICE dma map error checked
[  218.444783] r8169 0000:0b:00.0: single idx 573 P=3c423780 N=3c423 D=c7a780 L=9b DMA_TO_DEVICE dma map error checked
[  218.447825] r8169 0000:0b:00.0: single idx 573 P=3c4239c0 N=3c423 D=c7b9c0 L=6b DMA_TO_DEVICE dma map error checked
[  218.450844] r8169 0000:0b:00.0: single idx 574 P=3c423bc0 N=3c423 D=c7cbc0 L=7b DMA_TO_DEVICE dma map error checked
[  218.453814] r8169 0000:0b:00.0: single idx 574 P=3c423dc0 N=3c423 D=c7ddc0 L=7b DMA_TO_DEVICE dma map error checked
[  218.456793] r8169 0000:0b:00.0: single idx 575 P=3c423fc0 N=3c423 D=c7efc0 L=7b DMA_TO_DEVICE dma map error not checked
[  218.459772] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c423 .. end of dump
[  218.473504] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c716 .. start dump
[  218.475662] r8169 0000:0b:00.0: single idx 586 P=3c7160c0 N=3c716 D=c940c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.477874] r8169 0000:0b:00.0: single idx 586 P=3c716280 N=3c716 D=c95280 L=36 DMA_TO_DEVICE dma map error checked
[  218.480075] r8169 0000:0b:00.0: single idx 587 P=3c716440 N=3c716 D=c96440 L=36 DMA_TO_DEVICE dma map error checked
[  218.482245] r8169 0000:0b:00.0: single idx 587 P=3c716600 N=3c716 D=c97600 L=36 DMA_TO_DEVICE dma map error checked
[  218.484390] r8169 0000:0b:00.0: single idx 588 P=3c7167c0 N=3c716 D=c987c0 L=42 DMA_TO_DEVICE dma map error checked
[  218.486510] r8169 0000:0b:00.0: single idx 588 P=3c7169c0 N=3c716 D=c999c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.488603] r8169 0000:0b:00.0: single idx 589 P=3c716b80 N=3c716 D=c9ab80 L=42 DMA_TO_DEVICE dma map error checked
[  218.490682] r8169 0000:0b:00.0: single idx 589 P=3c716d80 N=3c716 D=c9bd80 L=42 DMA_TO_DEVICE dma map error checked
[  218.492735] r8169 0000:0b:00.0: single idx 590 P=3c716f80 N=3c716 D=c9cf80 L=42 DMA_TO_DEVICE dma map error not checked
[  218.494788] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c716 .. end of dump

--
Sander





Thursday, February 6, 2014, 3:26:09 PM, you wrote:

> On Thu, Feb 6, 2014 at 5:09 AM, Sander Eikelenboom <linux@eikelenboom.it> wrote:
>> Hmm ok that last message was false .. sorry for that .. it did happen again without r8169.use_dac=1, it just doesn't seem to happen all the time...
>>
>> Konrad / Wei, do you happen to know of any xen related change that went into 3.14 merge window that relates to dma / xen networking ?
>>
>> --
>> Sander
>>
>> complete stacktrace:
>>
>> [  342.710738] ------------[ cut here ]------------
>> [  342.726890] WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:491 add_dma_entry+0x105/0x130()
>> [  342.743210] DMA-API: exceeded 7 overlapping mappings of pfn 40b00
>> [  342.759510] Modules linked in:
>> [  342.775557] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc1-20140206-pcireset-net-btrevert+ #1
>> [  342.791706] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>> [  342.807627]  0000000000000009 ffff88005f603828 ffffffff81ad29fc ffffffff822134e0
>> [  342.823430]  ffff88005f603878 ffff88005f603868 ffffffff810bdf62 ffff880000000000
>> [  342.839081]  0000000000040b00 00000000ffffffef ffffffff822102e0 ffff8800592b9098
>> [  342.854572] Call Trace:
>> [  342.869748]  <IRQ>  [<ffffffff81ad29fc>] dump_stack+0x46/0x58
>> [  342.884915]  [<ffffffff810bdf62>] warn_slowpath_common+0x82/0xb0
>> [  342.899710]  [<ffffffff810be031>] warn_slowpath_fmt+0x41/0x50
>> [  342.914395]  [<ffffffff8147853a>] ? active_pfn_read_overlap+0x3a/0x70
>> [  342.929166]  [<ffffffff814792c5>] add_dma_entry+0x105/0x130
>> [  342.943733]  [<ffffffff814796c6>] debug_dma_map_page+0x126/0x150
>> [  342.957988]  [<ffffffff8171c8b6>] rtl8169_start_xmit+0x216/0xa20
>> [  342.972306]  [<ffffffff8195f08f>] ? dev_queue_xmit_nit+0x1ef/0x260
>> [  342.986523]  [<ffffffff8195eea0>] ? dev_loopback_xmit+0x1e0/0x1e0
>> [  343.000689]  [<ffffffff819631e6>] dev_hard_start_xmit+0x2e6/0x4a0
>> [  343.014466]  [<ffffffff81980f3e>] sch_direct_xmit+0xfe/0x280
>> [  343.028052]  [<ffffffff819635dc>] __dev_queue_xmit+0x23c/0x630
>> [  343.041338]  [<ffffffff819633a0>] ? dev_hard_start_xmit+0x4a0/0x4a0
>> [  343.054483]  [<ffffffff81a0a334>] ? ip_output+0x54/0xf0
>> [  343.067659]  [<ffffffff819639eb>] dev_queue_xmit+0xb/0x10
>> [  343.080804]  [<ffffffff81a0890b>] ip_finish_output+0x2cb/0x670
>> [  343.093746]  [<ffffffff81a0a334>] ? ip_output+0x54/0xf0
>> [  343.106391]  [<ffffffff81a0a334>] ip_output+0x54/0xf0
>> [  343.118683]  [<ffffffff81a05791>] ip_forward_finish+0x71/0x1a0
>> [  343.130901]  [<ffffffff81a05a63>] ip_forward+0x1a3/0x440
>> [  343.142829]  [<ffffffff810ffebb>] ? lock_is_held+0x8b/0xb0
>> [  343.154346]  [<ffffffff81a035c0>] ip_rcv_finish+0x150/0x660
>> [  343.165748]  [<ffffffff81a0406b>] ip_rcv+0x22b/0x370
>> [  343.176838]  [<ffffffff81a60972>] ? packet_rcv_spkt+0x42/0x190
>> [  343.187659]  [<ffffffff819609d2>] __netif_receive_skb_core+0x6d2/0x8a0
>> [  343.198209]  [<ffffffff81960414>] ? __netif_receive_skb_core+0x114/0x8a0
>> [  343.208819]  [<ffffffff81009010>] ? xen_clocksource_read+0x20/0x30
>> [  343.219471]  [<ffffffff81116e49>] ? getnstimeofday+0x9/0x30
>> [  343.229862]  [<ffffffff81960bbc>] __netif_receive_skb+0x1c/0x70
>> [  343.239953]  [<ffffffff81960c2e>] netif_receive_skb_internal+0x1e/0xf0
>> [  343.249908]  [<ffffffff81962110>] napi_gro_receive+0x70/0xa0
>> [  343.259509]  [<ffffffff817198a3>] rtl8169_poll+0x2d3/0x680
>> [  343.268982]  [<ffffffff81adcd2b>] ? _raw_spin_unlock_irq+0x2b/0x50
>> [  343.278091]  [<ffffffff819610d1>] net_rx_action+0x161/0x260
>> [  343.287056]  [<ffffffff810c28ec>] __do_softirq+0x12c/0x280
>> [  343.295756]  [<ffffffff810c2da2>] irq_exit+0xa2/0xd0
>> [  343.304235]  [<ffffffff814ffd5f>] xen_evtchn_do_upcall+0x2f/0x40
>> [  343.312387]  [<ffffffff81adf15e>] xen_do_hypervisor_callback+0x1e/0x30
>> [  343.320389]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [  343.328171]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [  343.335738]  [<ffffffff81008c70>] ? xen_safe_halt+0x10/0x20
>> [  343.343142]  [<ffffffff81018748>] ? default_idle+0x18/0x20
>> [  343.350202]  [<ffffffff81018f5e>] ? arch_cpu_idle+0x2e/0x40
>> [  343.356994]  [<ffffffff8110b551>] ? cpu_startup_entry+0x91/0x1e0
>> [  343.363658]  [<ffffffff81ac7d87>] ? rest_init+0xb7/0xc0
>> [  343.369924]  [<ffffffff81ac7cd0>] ? csum_partial_copy_generic+0x170/0x170
>> [  343.376057]  [<ffffffff8230ff1c>] ? start_kernel+0x409/0x416
>> [  343.381972]  [<ffffffff8230f912>] ? repair_env_string+0x5e/0x5e
>> [  343.387573]  [<ffffffff8230f5f8>] ? x86_64_start_reservations+0x2a/0x2c
>> [  343.393152]  [<ffffffff82312e28>] ? xen_start_kernel+0x586/0x588
>> [  343.398628] ---[ end trace 8379b598fb7ef5ee ]---
>>
>>
>>
>>
>>
>> Thursday, February 6, 2014, 12:36:31 PM, you wrote:
>>
>>> Hi Dan / Francois,
>>
>>> Didn't have time to test it before, but the patch doesn't seem to help.
>>> I'm still getting the "DMA-API: exceeded 7 overlapping mappings of pfn 55ebe",
>>> but i see now i forgot to mention i use r8169.use_dac=1 ...
>>
>>> Not using it seems to prevent the warning, but before 3.14 i have never seen this (with r8169.use_dac=1)

> If you are still hitting this with the patch:

>   59f2e7df574c dma-debug: fix overlap detection

> ...then I'm more inclined to think it is an actual positive report.

> If you don't mind I'll send some debug patches to narrow this down.

^ permalink raw reply related

* Re: [PATCH v2] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-02-11 19:56 UTC (permalink / raw)
  To: ext Vlad Yasevich
  Cc: linux-sctp@vger.kernel.org, netdev@vger.kernel.org,
	Alexander Sverdlin
In-Reply-To: <52FA3914.90002@gmail.com>

Hello Vlad,

On 02/11/2014 03:52 PM, ext Vlad Yasevich wrote:
> Hi Matija
> 
> On 02/09/2014 02:15 AM, Matija Glavinic Pecotic wrote:
>>
>> Proposed solution:
>>
>> Both problems share the same root cause, and that is improper scaling
> of socket
>> buffer with rwnd. Solution in which sizeof(sk_buff) is taken into
> concern while
>> calculating rwnd is not possible due to fact that there is no linear
>> relationship between amount of data blamed in increase/decrease with
> IP packet
>> in which payload arrived. Even in case such solution would be followed,
>> complexity of the code would increase. Due to nature of current rwnd
> handling,
>> slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure
> state is
>> entered is rationale, but it gives false representation to the sender
> of current
>> buffer space. Furthermore, it implements additional congestion control
> mechanism
>> which is defined on implementation, and not on standard basis.
>>
>> Proposed solution simplifies whole algorithm having on mind definition
> from rfc:
>>
>> o  Receiver Window (rwnd): This gives the sender an indication of the
> space
>>    available in the receiver's inbound buffer.
>>
>> Core of the proposed solution is given with these lines:
>>
>> sctp_assoc_rwnd_update:
>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> 	else
>> 		asoc->rwnd = 0;
>>
>> We advertise to sender (half of) actual space we have. Half is in the
> braces
>> depending whether you would like to observe size of socket buffer as
> SO_RECVBUF
>> or twice the amount, i.e. size is the one visible from userspace, that is,
>> from kernelspace.
>> In this way sender is given with good approximation of our buffer space,
>> regardless of the buffer policy - we always advertise what we have.
> Proposed
>> solution fixes described problems and removes necessity for rwnd
> restoration
>> algorithm. Finally, as proposed solution is simplification, some lines
> of code,
>> along with some bytes in struct sctp_association are saved.
>>
>> Signed-off-by: Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com>
>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>
>> --- net-next.orig/net/sctp/associola.c
>> +++ net-next/net/sctp/associola.c
>> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>>  	return false;
>>  }
>>
>> -/* Increase asoc's rwnd by len and send any window update SACK if
> needed. */
>> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned
> int len)
>> +/* Update asoc's rwnd for the approximated state in the buffer,
>> + * and check whether SACK needs to be sent.
>> + */
>> +void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool
> update_peer)
>>  {
>> +	int rx_count;
>>  	struct sctp_chunk *sack;
>>  	struct timer_list *timer;
>>
>> -	if (asoc->rwnd_over) {
>> -		if (asoc->rwnd_over >= len) {
>> -			asoc->rwnd_over -= len;
>> -		} else {
>> -			asoc->rwnd += (len - asoc->rwnd_over);
>> -			asoc->rwnd_over = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd += len;
>> -	}
>> +	if (asoc->ep->rcvbuf_policy)
>> +		rx_count = atomic_read(&asoc->rmem_alloc);
>> +	else
>> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>>
>> -	/* If we had window pressure, start recovering it
>> -	 * once our rwnd had reached the accumulated pressure
>> -	 * threshold.  The idea is to recover slowly, but up
>> -	 * to the initial advertised window.
>> -	 */
>> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
>> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
>> -		asoc->rwnd += change;
>> -		asoc->rwnd_press -= change;
>> -	}
>> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> +	else
>> +		asoc->rwnd = 0;
>>
>> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->a_rwnd);
>> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
>> +		 __func__, asoc, asoc->rwnd, rx_count,
>> +		 asoc->base.sk->sk_rcvbuf);
>>
>>  	/* Send a window update SACK if the rwnd has increased by at least the
>>  	 * minimum of the association's PMTU and half of the receive buffer.
>>  	 * The algorithm used is similar to the one described in
>>  	 * Section 4.2.3.3 of RFC 1122.
>>  	 */
>> -	if (sctp_peer_needs_update(asoc)) {
>> +	if (update_peer && sctp_peer_needs_update(asoc)) {
>>  		asoc->a_rwnd = asoc->rwnd;
>>
>>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
>> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>>  	}
>>  }
>>
>> -/* Decrease asoc's rwnd by len. */
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned
> int len)
>> -{
>> -	int rx_count;
>> -	int over = 0;
>> -
>> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
>> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
>> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
>> -			 asoc->rwnd, asoc->rwnd_over);
>> -
>> -	if (asoc->ep->rcvbuf_policy)
>> -		rx_count = atomic_read(&asoc->rmem_alloc);
>> -	else
>> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>> -
>> -	/* If we've reached or overflowed our receive buffer, announce
>> -	 * a 0 rwnd if rwnd would still be positive.  Store the
>> -	 * the potential pressure overflow so that the window can be restored
>> -	 * back to original value.
>> -	 */
>> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
>> -		over = 1;
>> -
>> -	if (asoc->rwnd >= len) {
>> -		asoc->rwnd -= len;
>> -		if (over) {
>> -			asoc->rwnd_press += asoc->rwnd;
>> -			asoc->rwnd = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd_over = len - asoc->rwnd;
>> -		asoc->rwnd = 0;
>> -	}
>> -
>> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->rwnd_press);
>> -}
>>
>>  /* Build the bind address list for the association based on info from the
>>   * local endpoint and the remote peer.
>> --- net-next.orig/include/net/sctp/structs.h
>> +++ net-next/include/net/sctp/structs.h
>> @@ -1653,17 +1653,6 @@ struct sctp_association {
>>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>>  	__u32 a_rwnd;
>>
>> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
>> -	 * to slop over a maximum of the association's frag_point.
>> -	 */
>> -	__u32 rwnd_over;
>> -
>> -	/* Keeps treack of rwnd pressure.  This happens when we have
>> -	 * a window, but not recevie buffer (i.e small packets).  This one
>> -	 * is releases slowly (1 PMTU at a time ).
>> -	 */
>> -	__u32 rwnd_press;
>> -
>>  	/* This is the sndbuf size in use for the association.
>>  	 * This corresponds to the sndbuf size for the association,
>>  	 * as specified in the sk->sndbuf.
>> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>>
>>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
>> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>> +void sctp_assoc_rwnd_update(struct sctp_association *, bool);
>>  void sctp_assoc_set_primary(struct sctp_association *,
>>  			    struct sctp_transport *);
>>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
>> --- net-next.orig/net/sctp/sm_statefuns.c
>> +++ net-next/net/sctp/sm_statefuns.c
>> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>>  	 * PMTU.  In cases, such as loopback, this might be a rather
>>  	 * large spill over.
>>  	 */
>> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>>
>>  		/* If this is the next TSN, consider reneging to make
>> --- net-next.orig/net/sctp/socket.c
>> +++ net-next/net/sctp/socket.c
>> @@ -2092,12 +2092,6 @@ static int sctp_recvmsg(struct kiocb *io
>>  		sctp_skb_pull(skb, copied);
>>  		skb_queue_head(&sk->sk_receive_queue, skb);
>>
>> -		/* When only partial message is copied to the user, increase
>> -		 * rwnd by that amount. If all the data in the skb is read,
>> -		 * rwnd is updated when the event is freed.
>> -		 */
>> -		if (!sctp_ulpevent_is_notification(event))
>> -			sctp_assoc_rwnd_increase(event->asoc, copied);
>>  		goto out;
>>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>>  		   (event->msg_flags & MSG_EOR))
>> --- net-next.orig/net/sctp/ulpevent.c
>> +++ net-next/net/sctp/ulpevent.c
>> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>>  	skb = sctp_event2skb(event);
>>  	/* Set the owner and charge rwnd for bytes received.  */
>>  	sctp_ulpevent_set_owner(event, asoc);
>> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>> +	sctp_assoc_rwnd_update(asoc, false);
>>
>>  	if (!skb->data_len)
>>  		return;
>> @@ -1035,8 +1035,9 @@ static void sctp_ulpevent_release_data(s
>>  	}
>>
>>  done:
>> -	sctp_assoc_rwnd_increase(event->asoc, len);
>> -	sctp_ulpevent_release_owner(event);
>> +	atomic_sub(event->rmem_len, &event->asoc->rmem_alloc);
>> +	sctp_assoc_rwnd_update(event->asoc, true);
>> +	sctp_association_put(event->asoc)
> 
> Can't we simply change the order of window update and release instead
> of open coding it like this?

that was the initial idea, but sctp_ulpevent_release_owner puts the association and calls sctp_association_destroy if its time to do so. IMHO, in the case if we would switch it, we would open a potential race condition.

I agree this doesn't look the best. But since we should call sctp_assoc_rwnd_update after accounting and before put, we have only option to move sctp_assoc_rwnd_update to _ulpevent_release_owner. As on this path we wish to update peer and generate sack, but we for sure do not want it on all paths where ulpevent_release_owner is used, I see no alternative but to add additional parameter to ulpevent_release_owner which would be just passed to rwnd_update - bool update_peer. On the other hand, I wonder whether ulpevent_release_owner would do more then it should in that case?

> 
> -vlad
> 
>>  }
>>
>>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] dp83640: Support a configurable number of periodic outputs
From: Richard Cochran @ 2014-02-11 20:09 UTC (permalink / raw)
  To: Stefan Sørensen
  Cc: grant.likely, robh+dt, mark.rutland, netdev, linux-kernel,
	devicetree
In-Reply-To: <1392132562-23644-2-git-send-email-stefan.sorensen@spectralink.com>

On Tue, Feb 11, 2014 at 04:29:21PM +0100, Stefan Sørensen wrote:

> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 547725f..d4fe95d 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -38,15 +38,11 @@
>  #define LAYER4		0x02
>  #define LAYER2		0x01
>  #define MAX_RXTS	64
> -#define N_EXT_TS	6
> +#define N_EXT		8
>  #define PSF_PTPVER	2
>  #define PSF_EVNT	0x4000
>  #define PSF_RX		0x2000
>  #define PSF_TX		0x1000
> -#define EXT_EVENT	1

Regarding this EXT_EVENT thing ...

> @@ -430,12 +419,12 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
>  	switch (rq->type) {
>  	case PTP_CLK_REQ_EXTTS:
>  		index = rq->extts.index;
> -		if (index < 0 || index >= N_EXT_TS)
> +		if (index < 0 || index >= n_ext_ts)
>  			return -EINVAL;
> -		event_num = EXT_EVENT + index;
> +		event_num = index;

there was a mapping between the "event numbers" and the external time
stamp channels. I don't remember off the top of my head why this these
two differ by one, but there was a good reason.

Are you sure this is still working with this change?

I am especially wondering about the event decoding here:

> @@ -642,7 +631,7 @@ static void recalibrate(struct dp83640_clock *clock)
>  
>  static inline u16 exts_chan_to_edata(int ch)
>  {
> -	return 1 << ((ch + EXT_EVENT) * 2);
> +	return 1 << ((ch) * 2);
>  }

Maybe I am just paranoid, but can you remind me how these event
numbers are supposed to work, before and after the change?

Thanks,
Richard

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 20:15 UTC (permalink / raw)
  To: vyasevic, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA6A24.3030402@redhat.com>

On 02/11/14 13:21, Vlad Yasevich wrote:
> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>> On 02/10/14 11:31, Vlad Yasevich wrote:

> No, this was more the point that the current iproute code sends an
> ifinfomsg struct down, and you change that to send ndmsg struct.
> This is risky, but we luck out since the index is at the same offset
> in both structs.
>

ah, ok, thanks for catching that. I should have said something - the
original code was wrong and i felt it was safe to make the change
given that the kernel code never even looked at what was being
sent to it. There is asymetry desires which are violated.
It doesnt make sense to send and ifm and expect back an ndm.
I should send that separately as a bug fix.

> But that would only happen if the user said:
>    # bridge fdb show br eth0
>
> If eth0 in this case is a hw bridge device, getting the device's
> version of fdb data is exactly what would be expected, isn't it?
>

Well, if it is a "bridge device" why would it not be tagged as a bridge
device?

> If you mean a 'software bridge' above, then that's not an issue
> since that's a disallowed config.  You can't stack software bridges
> without something in the middle like bond or vlan.
>

Ok, didnt realize that.
So i cant add a bridge as a bridge port to another bridge?

>
> Yes, macvlan can forward data to other macvlans, but that's
> not the interesting thing.

Sample config?

> When you configure multiple macvlan devices on top of the
> same hw device, one could think of the hw device as a sort
> of a bridge.  It's not really, but you could define it in
> those terms.  The fdb entries, in this case, contain the mac
> addresses of the macvlan devices.
>

It certainly has some equivalent semantics (looks at dst MAC then
picks the port). Possible to add Vlans as well?
Why dont we tag such a thing as a bridge then?

>
> Sorry, I wasn't very clear. What I meant was that you now support
>    # bridge fdb show port <>
>
> The usage message should reflect it.
>

Sorry - I noticed the word "port" at exactly where your quote came.
So i thought you noticed that "port" was already taken - it is used
for VXLAN fdb entries (for udp ports).

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v2 2/2] dp83640: Get pin and master/slave configuration from DT
From: Richard Cochran @ 2014-02-11 20:19 UTC (permalink / raw)
  To: Stefan Sørensen
  Cc: grant.likely, robh+dt, mark.rutland, netdev, linux-kernel,
	devicetree
In-Reply-To: <1392132562-23644-3-git-send-email-stefan.sorensen@spectralink.com>

On Tue, Feb 11, 2014 at 04:29:22PM +0100, Stefan Sørensen wrote:

> diff --git a/Documentation/devicetree/bindings/net/dp83640.txt b/Documentation/devicetree/bindings/net/dp83640.txt
> new file mode 100644
> index 0000000..b9a57c0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/dp83640.txt
> @@ -0,0 +1,29 @@
> +Required properties for the National DP83640 ethernet phy:
> +
> +- compatible : Must contain "national,dp83640"
> +
> +Optional properties:
> +
> +- dp83640,slave: If present, this phy will be slave to another dp83640
> +  on the same mdio bus.

Wouldn't it be more natural to have one "dp83640,master" property
rather than multiple slave properties?

> @@ -949,6 +940,95 @@ static void dp83640_clock_put(struct dp83640_clock *clock)
>  	mutex_unlock(&clock->clock_lock);
>  }
>  
> +#ifdef CONFIG_OF
> +static int dp83640_probe_dt(struct device_node *node,
> +			    struct dp83640_private *dp83640)
> +{
> +	struct dp83640_clock *clock = dp83640->clock;
> +	struct property *prop;
> +	int err, proplen;
> +
> +	dp83640->slave = of_property_read_bool(node, "dp83640,slave");
> +	if (!dp83640->slave && clock->chosen) {
> +		pr_err("dp83640,slave must be set if more than one device on the same bus");

Most of these pr_err lines are a bit _way_ too long for coding style.

> +		return -EINVAL;
> +	}
> +
> +	prop = of_find_property(node, "dp83640,perout-pins", &proplen);
> +	if (prop) {
> +		if (dp83640->slave) {
> +			pr_err("dp83640,perout-pins property can not be set together with dp83640,slave");

(Here especially and in the code that followed.)

Overall the series is looking better. I will try to test the non-DT
case later on this week.

Thanks,
Richard

^ permalink raw reply

* Re: xfrm: is pmtu broken with ESP tunneling?
From: Ortwin Glück @ 2014-02-11 20:20 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: linux-kernel, netdev
In-Reply-To: <20140211023258.GC11150@order.stressinduktion.org>

On 02/11/2014 03:32 AM, Hannes Frederic Sowa wrote:
>> net.ipv4.ip_no_pmtu_disc=1.
>
> This setting will shrink the path mtu to min_pmtu when a frag needed icmp is
> received.

The UDP+ESP encapsulation adds 60 bytes to the original packet size.

ifconfig wla0 shows an mtu of 1500.

The size of the first big packet on the interface:
net.ipv4.ip_no_pmtu_disc=1: packet length is 1300
net.ipv4.ip_no_pmtu_disc=0: packet length is 1500

Length is without the ESP wrapper and UDP encapsulation. The packets are so big 
that they can't even leave the wireless interface and never show up on the 
router. So no ICMP packets are received. PMTU can't work with initial packets of 
that size.

dump question: which layer discard these packets? qdisc? why no notification to 
the sender?

When I increase the mtu of the interface to 2000 with ifconfig, then I start 
seeing ICMP fragmentation needed from the next hop, indicating 1500 as the mtu 
as response to a 1560 byte UDP[ESP] packet.

The next UDP[ESP] packet is shorter: 1360 bytes. It gets hard to see what's 
going on after that, but the connection is still not working.

So, instead of somehow losing these packets on the way out of the interface 
should the kernel not start with a lower mtu in the first place? Now it seems it 
is trying with the maximum of the interface and expecting to scale down with 
pmtu - which can ever happen.

> Can you send a ip route get <ip> to the problematic target to see how
> far off the calculated value is?

That command doesn't return anything useful. No hint on the mtu here.

BTW, instead of disabling pmtu, setting mtu explicitly also helps:
ip route add 10.6.6.0/24 via ${localip} mtu 1300

Thanks,

Ortwin

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: John Fastabend @ 2014-02-11 20:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: vyasevic, netdev@vger.kernel.org, Stephen Hemminger,
	Scott Feldman
In-Reply-To: <52FA84FA.2030608@mojatatu.com>

On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
>
>>
>> Yes, macvlan can forward data to other macvlans, but that's
>> not the interesting thing.
>
> Sample config?

ip link add link ethx name mv1 type macvlan mode bridge
ip link add link ethx name mv2 type macvlan mode bridge

Now you have a macvlan on ethx that will forward data between
mv1 and mv2.

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: John Fastabend @ 2014-02-11 20:30 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: vyasevic, netdev@vger.kernel.org, Stephen Hemminger,
	Scott Feldman
In-Reply-To: <52FA84FA.2030608@mojatatu.com>

On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
> On 02/11/14 13:21, Vlad Yasevich wrote:
>> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>>> On 02/10/14 11:31, Vlad Yasevich wrote:
>
>> No, this was more the point that the current iproute code sends an
>> ifinfomsg struct down, and you change that to send ndmsg struct.
>> This is risky, but we luck out since the index is at the same offset
>> in both structs.
>>
>
> ah, ok, thanks for catching that. I should have said something - the
> original code was wrong and i felt it was safe to make the change
> given that the kernel code never even looked at what was being
> sent to it. There is asymetry desires which are violated.
> It doesnt make sense to send and ifm and expect back an ndm.
> I should send that separately as a bug fix.
>
>
>> But that would only happen if the user said:
>>    # bridge fdb show br eth0
>>
>> If eth0 in this case is a hw bridge device, getting the device's
>> version of fdb data is exactly what would be expected, isn't it?
>>
>
> Well, if it is a "bridge device" why would it not be tagged as a bridge
> device?

What do you mean by "bridge device" are you specifically talking about
IFF_BRIDGE flag? This flag is used only for ./net/bridge devices. For
example macvlan uses its own flag. I think there is a good case to be
made for netdevices which are acting as the management interface for a
hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.

>
>> If you mean a 'software bridge' above, then that's not an issue
>> since that's a disallowed config.  You can't stack software bridges
>> without something in the middle like bond or vlan.
>>
>
> Ok, didnt realize that.
> So i cant add a bridge as a bridge port to another bridge?
>

# ip link set dev bridge0 master bridge1
RTNETLINK answers: Too many levels of symbolic links

in the bridge case this doesn't work. But you can stack a macvlan
on top of the bridge port,

# ip link add link bridge0 type macvlan mode vepa

11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop 
state DOWN mode DEFAULT group default

And macvlans on macvlans is OK as well.

# ip link add link macvlan0 type macvlan mode vepa

[...]

>> When you configure multiple macvlan devices on top of the
>> same hw device, one could think of the hw device as a sort
>> of a bridge.  It's not really, but you could define it in
>> those terms.  The fdb entries, in this case, contain the mac
>> addresses of the macvlan devices.
>>
>
> It certainly has some equivalent semantics (looks at dst MAC then
> picks the port). Possible to add Vlans as well?
> Why dont we tag such a thing as a bridge then?
>

If its useful then we should. You can track them down in userspace
via /sys/class/net/ or looking for offloaded netdevices that point
to the interface but a flag is definitely more direct.

>>
>> Sorry, I wasn't very clear. What I meant was that you now support
>>    # bridge fdb show port <>
>>
>> The usage message should reflect it.
>>
>
> Sorry - I noticed the word "port" at exactly where your quote came.
> So i thought you noticed that "port" was already taken - it is used
> for VXLAN fdb entries (for udp ports).
>
>
> cheers,
> jamal

^ permalink raw reply

* Re: [PATCH 1/6] staging: r8188eu: Replace wrapper around _rtw_memcmp()
From: Greg KH @ 2014-02-11 20:40 UTC (permalink / raw)
  To: Larry Finger; +Cc: devel, netdev
In-Reply-To: <1391980559-24288-2-git-send-email-Larry.Finger@lwfinger.net>

On Sun, Feb 09, 2014 at 03:15:54PM -0600, Larry Finger wrote:
> This wrapper is replaced with a simple memcmp(). As the wrapper inverts the
> logic of memcmp(), care needed to be taken.

That's just evil, ugh, nice job...

^ permalink raw reply

* Re: [PATCH 6/6] staging: r8188eu: Remove _func_enter and _func_exit macros
From: Greg KH @ 2014-02-11 20:41 UTC (permalink / raw)
  To: Larry Finger; +Cc: devel, netdev
In-Reply-To: <1391980559-24288-7-git-send-email-Larry.Finger@lwfinger.net>

On Sun, Feb 09, 2014 at 03:15:59PM -0600, Larry Finger wrote:
> These debugging macros are seldom used for debugging once the driver
> is working. If routine tracing is needed, it can be added on an
> individual basis.

No, you can use the in-kernel tracing functionality :)

nice job.

greg k-h

^ permalink raw reply

* Re: [PATCH 3/3] net: GSO encapsulation for IP packets
From: Tom Herbert @ 2014-02-11 20:45 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David S. Miller, Linux Netdev List, Or Gerlitz
In-Reply-To: <CAADnVQ+PcO9-BGVE+ExFZt_sxhN1gDtiXtxMHPfHEX7wB3cWWg@mail.gmail.com>

On Tue, Feb 11, 2014 at 11:12 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Feb 11, 2014 at 9:43 AM, Tom Herbert <therbert@google.com> wrote:
>> The UDP GSO code assume that only encapsulated packets are Ethernet
>> frames. This patch fixes that so that we can support IP protocol
>> encpasulation (GUE, GRE/UDP, etc.)
>>
>> We overload the inner_protocol field in the skb to store either the
>> Ethertype or the IP protocol (latter is indicated by ip_encapsulation
>> bit). As far as I can tell this should not adversely affect preexiting
>> uses for inner_protocol.
>>
>> Signed-off-by
>> +++ b/net/ipv4/udp.c
>> @@ -2497,7 +2497,17 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
>>
>>         /* segment inner packet. */
>>         enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
>> -       segs = skb_mac_gso_segment(skb, enc_features);
>> +
>> +       if (skb->ip_encapsulation) {
>> +               const struct net_offload *ops;
>> +               ops = rcu_dereference(inet_offloads[skb->inner_protocol]);
>> +               if (likely(ops && ops->callbacks.gso_segment))
>> +                       segs = ops->callbacks.gso_segment(skb, enc_features);
>> +       } else {
>> +               skb->protocol = htons(ETH_P_TEB);
>
> duplicate assignment ? Do you want to remove line 2496 which did the same
> or proto=teb applies to ip_encap case as well?
>

Thanks for catching that!  I think the assignment at 2496 should be removed.

>> +               segs = skb_mac_gso_segment(skb, enc_features);
>> +       }
>> +
>>         if (!segs || IS_ERR(segs)) {
>>                 skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
>>                                      mac_len);
>> --
>> 1.9.0.rc1.175.g0b1dcb5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Vlad Yasevich @ 2014-02-11 21:00 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA84FA.2030608@mojatatu.com>

On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:
> On 02/11/14 13:21, Vlad Yasevich wrote:
>> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>>> On 02/10/14 11:31, Vlad Yasevich wrote:
> 
>> No, this was more the point that the current iproute code sends an
>> ifinfomsg struct down, and you change that to send ndmsg struct.
>> This is risky, but we luck out since the index is at the same offset
>> in both structs.
>>
> 
> ah, ok, thanks for catching that. I should have said something - the
> original code was wrong and i felt it was safe to make the change
> given that the kernel code never even looked at what was being
> sent to it. There is asymetry desires which are violated.
> It doesnt make sense to send and ifm and expect back an ndm.
> I should send that separately as a bug fix.
> 
> 
>> But that would only happen if the user said:
>>    # bridge fdb show br eth0
>>
>> If eth0 in this case is a hw bridge device, getting the device's
>> version of fdb data is exactly what would be expected, isn't it?
>>
> 
> Well, if it is a "bridge device" why would it not be tagged as a bridge
> device?

Because it just a multi-function nic that isn't tagged with any
kine of bridge flag.  As John said, this might be useful, but not
done yet.

> 
>> If you mean a 'software bridge' above, then that's not an issue
>> since that's a disallowed config.  You can't stack software bridges
>> without something in the middle like bond or vlan.
>>
> 
> Ok, didnt realize that.
> So i cant add a bridge as a bridge port to another bridge?

Not directly.  However, if you put a layered software device in between
(vlan, bond, macvlan), then you can add that device to another bridge.
In fact, people do that to get GVRP working with VMs.

> 
>>
>> Yes, macvlan can forward data to other macvlans, but that's
>> not the interesting thing.
> 
> Sample config?
> 
>> When you configure multiple macvlan devices on top of the
>> same hw device, one could think of the hw device as a sort
>> of a bridge.  It's not really, but you could define it in
>> those terms.  The fdb entries, in this case, contain the mac
>> addresses of the macvlan devices.
>>
> 
> It certainly has some equivalent semantics (looks at dst MAC then
> picks the port). Possible to add Vlans as well?

I suppose.   You can do things like:
# ip link add link eth0 dev vlan100 protocol 8021Q id 100
# ip link add link vlan0 dev mac100 type macvlan

Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
Expressing this in terms of fdb would be a bit difficult since each
interface is separate and eth0 doesn't really know about the stack.
It would require quite a lot of code.

> Why dont we tag such a thing as a bridge then?
> 

Because they are not always a bridge.  It could be just a nic capable of
mac filtering.

>>
>> Sorry, I wasn't very clear. What I meant was that you now support
>>    # bridge fdb show port <>
>>
>> The usage message should reflect it.
>>
> 
> Sorry - I noticed the word "port" at exactly where your quote came.
> So i thought you noticed that "port" was already taken - it is used
> for VXLAN fdb entries (for udp ports).
>

Didn't realize it has different connotation for vxlan.  The you probably
don't want to include and support in the bridge fdb show command.

-vlad

> 
> cheers,
> jamal

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 21:04 UTC (permalink / raw)
  To: John Fastabend
  Cc: vyasevic, netdev@vger.kernel.org, Stephen Hemminger,
	Scott Feldman
In-Reply-To: <52FA8865.1070302@intel.com>

On 02/11/14 15:30, John Fastabend wrote:
> On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:

Thanks for the example on the other email.


> What do you mean by "bridge device" are you specifically talking about
> IFF_BRIDGE flag? This flag is used only for ./net/bridge devices.

Right - the simple definition is this thing has an fdb.
Yes, I know weve added vlan filtering and multicast snooping
but thats all lipstick. If it has an (ethernet) fdb it is a bridge.

>For
> example macvlan uses its own flag. I think there is a good case to be
> made for netdevices which are acting as the management interface for a
> hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.
>

If you introduce IFF_HWBRIDGE - I think that would satisfy the
distinction. The question then is why not just tag it IFF_BRIDGE?

>
> # ip link set dev bridge0 master bridge1
> RTNETLINK answers: Too many levels of symbolic links
>

pourquoi?  If the original rationale was to limit the
broadcast domain scope it sounds strange that a bridge in
the form a macvlan is allowed.

> in the bridge case this doesn't work. But you can stack a macvlan
> on top of the bridge port,
>
> # ip link add link bridge0 type macvlan mode vepa
>
> 11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop
> state DOWN mode DEFAULT group default
>
> And macvlans on macvlans is OK as well.
>
> # ip link add link macvlan0 type macvlan mode vepa
>
> [...]
>

Ok, I need to let that sink in. Cool actually.


>
> If its useful then we should. You can track them down in userspace
> via /sys/class/net/ or looking for offloaded netdevices that point
> to the interface but a flag is definitely more direct.
>

I prefer a flag. Then i can deal with it via netlink.

cheers,
jamal

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 21:08 UTC (permalink / raw)
  To: vyasevic, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA8F8B.3080500@redhat.com>

On 02/11/14 16:00, Vlad Yasevich wrote:
> On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:

>
> Because it just a multi-function nic that isn't tagged with any
> kine of bridge flag.  As John said, this might be useful, but not
> done yet.
>

Ok, fair enough. Someone should send a patch - John perhaps.

>
> Not directly.  However, if you put a layered software device in between
> (vlan, bond, macvlan), then you can add that device to another bridge.
> In fact, people do that to get GVRP working with VMs.
>

Do you recall the reasoning behind it?


>> It certainly has some equivalent semantics (looks at dst MAC then
>> picks the port). Possible to add Vlans as well?
>
> I suppose.   You can do things like:
> # ip link add link eth0 dev vlan100 protocol 8021Q id 100
> # ip link add link vlan0 dev mac100 type macvlan
>
> Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
> Expressing this in terms of fdb would be a bit difficult since each
> interface is separate and eth0 doesn't really know about the stack.
> It would require quite a lot of code.
>

nice.

>> Why dont we tag such a thing as a bridge then?
>>
>
> Because they are not always a bridge.  It could be just a nic capable of
> mac filtering.
>

I think in one of the modes it is merely a filter.
But you turn on this other feature it is a bridge.

>
> Didn't realize it has different connotation for vxlan.  The you probably
> don't want to include and support in the bridge fdb show command.

Thats what i thought you said earlier ;->

cheers,
jamal

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 21:12 UTC (permalink / raw)
  To: vyasevic, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA9167.2040305@mojatatu.com>

On 02/11/14 16:08, Jamal Hadi Salim wrote:
> On 02/11/14 16:00, Vlad Yasevich wrote:
>> On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:
>

> I think in one of the modes it is merely a filter.
> But you turn on this other feature it is a bridge.

IOW, VEPA should turn off IFF_BRIDGE and VEB should
turn it on. We probably need an event generated.


cheers,
jamal

^ permalink raw reply

* Re: 3.14-mw regression: rtl8169 WARNING: DMA-API: exceeded 7 overlapping mappings of pfn 55ebe
From: Eric Dumazet @ 2014-02-11 21:28 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Dan Williams, Konrad Rzeszutek Wilk, Wei Liu, Francois Romieu,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1434499348.20140211205617@eikelenboom.it>

On Tue, 2014-02-11 at 20:56 +0100, Sander Eikelenboom wrote:
> Hi Dan,
> 
> FYI just tested and put Xen out of the equation (booting baremetal) and it still persists.
> 
> I tried something else .. don't know if it gives you anymore insights, but it's worth the try:
> 
> diff --git a/lib/dma-debug.c b/lib/dma-debug.c
> index 2defd13..0fe5b75 100644
> --- a/lib/dma-debug.c
> +++ b/lib/dma-debug.c
> @@ -474,11 +474,11 @@ static int active_pfn_set_overlap(unsigned long pfn, int overlap)
>         return overlap;
>  }
> 
> -static void active_pfn_inc_overlap(unsigned long pfn)
> +static void active_pfn_inc_overlap(struct dma_debug_entry *ent)
>  {
> -       int overlap = active_pfn_read_overlap(pfn);
> +       int overlap = active_pfn_read_overlap(ent->pfn);
> 
> -       overlap = active_pfn_set_overlap(pfn, ++overlap);
> +       overlap = active_pfn_set_overlap(ent->pfn, ++overlap);
> 
>         /* If we overflowed the overlap counter then we're potentially
>          * leaking dma-mappings.  Otherwise, if maps and unmaps are
> @@ -486,15 +486,43 @@ static void active_pfn_inc_overlap(unsigned long pfn)
>          * debug_dma_assert_idle() as the pfn may be marked idle
>          * prematurely.
>          */
> +
>         WARN_ONCE(overlap > ACTIVE_PFN_MAX_OVERLAP,
>                   "DMA-API: exceeded %d overlapping mappings of pfn %lx\n",
> -                 ACTIVE_PFN_MAX_OVERLAP, pfn);
> +                 ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
> +
> +       if(overlap > ACTIVE_PFN_MAX_OVERLAP){
> +
> +               dev_info(ent->dev, "DMA-API: exceeded %d overlapping mappings of pfn %lx .. start dump\n", ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
> +               int idx;
> +
> +               for (idx = 0; idx < HASH_SIZE; idx++) {
> +                    struct hash_bucket *bucket = &dma_entry_hash[idx];
> +                    struct dma_debug_entry *entry;
> +                   unsigned long flags;
> +
> +                    list_for_each_entry(entry, &bucket->list, list) {
> +                                       if (entry->pfn == ent->pfn) {
> +                                           dev_info(entry->dev, "%s idx %d P=%Lx N=%lx D=%Lx L=%Lx %s %s\n",
> +                                                type2name[entry->type], idx,
> +                                                phys_addr(entry), entry->pfn,
> +                                                entry->dev_addr, entry->size,
> +                                                dir2name[entry->direction],
> +                                               maperr2str[entry->map_err_type]);
> +                                       }
> +                    }
> +               }
> +               dev_info(ent->dev, "DMA-API: exceeded %d overlapping mappings of pfn %lx .. end of dump\n", ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
> +       }
>  }
> 
> 
> @@ -505,10 +533,10 @@ static int active_pfn_insert(struct dma_debug_entry *entry)
> 
>         spin_lock_irqsave(&radix_lock, flags);
>         rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
> -       if (rc == -EEXIST)
> -               active_pfn_inc_overlap(entry->pfn);
> +       if (rc == -EEXIST){
> +               active_pfn_inc_overlap(entry);
> +       }
>         spin_unlock_irqrestore(&radix_lock, flags);
> -
>         return rc;
>  }
> 
> 
> This results in:
> [   27.708678] r8169 0000:0a:00.0 eth1: link down
> [   27.712102] r8169 0000:0a:00.0 eth1: link down
> [   28.015340] r8169 0000:0b:00.0 eth0: link down
> [   28.015368] r8169 0000:0b:00.0 eth0: link down
> [   29.654844] r8169 0000:0b:00.0 eth0: link up
> [   30.278542] r8169 0000:0a:00.0 eth1: link up
> [   60.829503] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [   69.708979] EXT4-fs (dm-42): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [   76.128678] EXT4-fs (dm-43): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [   82.922836] EXT4-fs (dm-44): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [   89.232889] EXT4-fs (dm-45): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [   95.359859] EXT4-fs (dm-46): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [  101.638559] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
> [  218.073407] ------------[ cut here ]------------
> [  218.080983] WARNING: CPU: 5 PID: 0 at lib/dma-debug.c:492 add_dma_entry+0xf1/0x210()
> [  218.088550] DMA-API: exceeded 7 overlapping mappings of pfn 3c421
> [  218.095988] Modules linked in:
> [  218.103270] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W    3.14.0-rc2-20140211-pcireset-net-btrevert-xenblock-dmadebug5+ #1
> [  218.110712] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
> [  218.118134]  0000000000000009 ffff88003fd437b8 ffffffff81b809c4 ffff88003e308000
> [  218.125556]  ffff88003fd43808 ffff88003fd437f8 ffffffff810c985c 0000000000000000
> [  218.132917]  00000000ffffffef 0000000000000036 ffff88003d9d3c00 0000000000000282
> [  218.140154] Call Trace:
> [  218.147193]  <IRQ>  [<ffffffff81b809c4>] dump_stack+0x46/0x58
> [  218.154271]  [<ffffffff810c985c>] warn_slowpath_common+0x8c/0xc0
> [  218.161293]  [<ffffffff810c9946>] warn_slowpath_fmt+0x46/0x50
> [  218.168227]  [<ffffffff814f2cfa>] ? active_pfn_read_overlap+0x3a/0x70
> [  218.175116]  [<ffffffff814f41d1>] add_dma_entry+0xf1/0x210
> [  218.181865]  [<ffffffff814f4646>] debug_dma_map_page+0x126/0x150
> [  218.188484]  [<ffffffff817aabeb>] rtl8169_start_xmit+0x21b/0xa20
> [  218.195042]  [<ffffffff81a01877>] ? dev_queue_xmit_nit+0x1d7/0x260
> [  218.201553]  [<ffffffff81a0188f>] ? dev_queue_xmit_nit+0x1ef/0x260
> [  218.207965]  [<ffffffff81a016a5>] ? dev_queue_xmit_nit+0x5/0x260
> [  218.214290]  [<ffffffff81a0661f>] dev_hard_start_xmit+0x37f/0x590
> [  218.220481]  [<ffffffff81a26cae>] sch_direct_xmit+0xfe/0x280
> [  218.226529]  [<ffffffff81a06a7f>] __dev_queue_xmit+0x24f/0x660
> [  218.232521]  [<ffffffff81a06835>] ? __dev_queue_xmit+0x5/0x660
> [  218.238439]  [<ffffffff81ab21b9>] ? ip_output+0x59/0xf0
> [  218.244272]  [<ffffffff81a06eb0>] dev_queue_xmit+0x10/0x20
> [  218.250043]  [<ffffffff81ab076b>] ip_finish_output+0x2cb/0x670
> [  218.255682]  [<ffffffff81ab21b9>] ? ip_output+0x59/0xf0
> [  218.261168]  [<ffffffff81ab21b9>] ip_output+0x59/0xf0
> [  218.266559]  [<ffffffff81aad596>] ip_forward_finish+0x76/0x1a0
> [  218.271883]  [<ffffffff81aad86b>] ip_forward+0x1ab/0x440
> [  218.277148]  [<ffffffff81aab380>] ip_rcv_finish+0x150/0x660
> [  218.282373]  [<ffffffff81aabe3b>] ip_rcv+0x22b/0x370
> [  218.287436]  [<ffffffff81b09bc7>] ? packet_rcv_spkt+0x47/0x190
> [  218.292372]  [<ffffffff81a03272>] __netif_receive_skb_core+0x722/0x8f0
> [  218.297328]  [<ffffffff81a02c75>] ? __netif_receive_skb_core+0x125/0x8f0
> [  218.302304]  [<ffffffff8112ce6e>] ? getnstimeofday+0xe/0x30
> [  218.307296]  [<ffffffff819f42c5>] ? __netdev_alloc_frag+0x175/0x1b0
> [  218.312166]  [<ffffffff81a03461>] __netif_receive_skb+0x21/0x70
> [  218.316904]  [<ffffffff81a034d3>] netif_receive_skb_internal+0x23/0xf0
> [  218.321596]  [<ffffffff81a04d2d>] napi_gro_receive+0x8d/0x100
> [  218.326219]  [<ffffffff817a7bc3>] rtl8169_poll+0x2d3/0x680
> [  218.330754]  [<ffffffff8112e366>] ? update_wall_time+0x356/0x690
> [  218.335208]  [<ffffffff81a03a0a>] net_rx_action+0x18a/0x2c0
> [  218.339595]  [<ffffffff810ce6f1>] ? __do_softirq+0xc1/0x300
> [  218.343890]  [<ffffffff810ce767>] __do_softirq+0x137/0x300
> [  218.348085]  [<ffffffff810cec9a>] irq_exit+0xaa/0xd0
> [  218.352203]  [<ffffffff81b8e5a7>] do_IRQ+0x67/0x110
> [  218.356225]  [<ffffffff81b8b772>] common_interrupt+0x72/0x72
> [  218.360156]  <EOI>  [<ffffffff810536e6>] ? native_safe_halt+0x6/0x10
> [  218.364087]  [<ffffffff81113a7d>] ? trace_hardirqs_on+0xd/0x10
> [  218.367935]  [<ffffffff81020632>] default_idle+0x32/0xd0
> [  218.371691]  [<ffffffff8102071e>] amd_e400_idle+0x4e/0x140
> [  218.375360]  [<ffffffff81020f86>] arch_cpu_idle+0x36/0x40
> [  218.378921]  [<ffffffff81120a01>] cpu_startup_entry+0xa1/0x2a0
> [  218.382508]  [<ffffffff810473cf>] start_secondary+0x1af/0x210
> [  218.386133] ---[ end trace 0e12f271209e2c18 ]---
> [  218.389769] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c421 .. start dump
> [  218.393566] r8169 0000:0b:00.0: single idx 563 P=3c421100 N=3c421 D=c66100 L=36 DMA_TO_DEVICE dma map error checked
> [  218.397379] r8169 0000:0b:00.0: single idx 563 P=3c4212c0 N=3c421 D=c672c0 L=36 DMA_TO_DEVICE dma map error checked
> [  218.401094] r8169 0000:0b:00.0: single idx 564 P=3c421480 N=3c421 D=c68480 L=36 DMA_TO_DEVICE dma map error checked
> [  218.404730] r8169 0000:0b:00.0: single idx 564 P=3c421640 N=3c421 D=c69640 L=36 DMA_TO_DEVICE dma map error checked
> [  218.408310] r8169 0000:0b:00.0: single idx 565 P=3c421800 N=3c421 D=c6a800 L=36 DMA_TO_DEVICE dma map error checked
> [  218.411762] r8169 0000:0b:00.0: single idx 565 P=3c4219c0 N=3c421 D=c6b9c0 L=36 DMA_TO_DEVICE dma map error checked
> [  218.415075] r8169 0000:0b:00.0: single idx 566 P=3c421b80 N=3c421 D=c6cb80 L=9b DMA_TO_DEVICE dma map error checked
> [  218.418305] r8169 0000:0b:00.0: single idx 566 P=3c421dc0 N=3c421 D=c6ddc0 L=36 DMA_TO_DEVICE dma map error checked
> [  218.421502] r8169 0000:0b:00.0: single idx 567 P=3c421f80 N=3c421 D=c6ef80 L=36 DMA_TO_DEVICE dma map error not checked
> [  218.424677] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c421 .. end of dump
> [  218.429050] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c423 .. start dump
> [  218.432225] r8169 0000:0b:00.0: single idx 571 P=3c423040 N=3c423 D=c76040 L=36 DMA_TO_DEVICE dma map error checked
> [  218.435408] r8169 0000:0b:00.0: single idx 571 P=3c423200 N=3c423 D=c77200 L=36 DMA_TO_DEVICE dma map error checked
> [  218.438578] r8169 0000:0b:00.0: single idx 572 P=3c4233c0 N=3c423 D=c783c0 L=36 DMA_TO_DEVICE dma map error checked
> [  218.441695] r8169 0000:0b:00.0: single idx 572 P=3c423580 N=3c423 D=c79580 L=7b DMA_TO_DEVICE dma map error checked
> [  218.444783] r8169 0000:0b:00.0: single idx 573 P=3c423780 N=3c423 D=c7a780 L=9b DMA_TO_DEVICE dma map error checked
> [  218.447825] r8169 0000:0b:00.0: single idx 573 P=3c4239c0 N=3c423 D=c7b9c0 L=6b DMA_TO_DEVICE dma map error checked
> [  218.450844] r8169 0000:0b:00.0: single idx 574 P=3c423bc0 N=3c423 D=c7cbc0 L=7b DMA_TO_DEVICE dma map error checked
> [  218.453814] r8169 0000:0b:00.0: single idx 574 P=3c423dc0 N=3c423 D=c7ddc0 L=7b DMA_TO_DEVICE dma map error checked
> [  218.456793] r8169 0000:0b:00.0: single idx 575 P=3c423fc0 N=3c423 D=c7efc0 L=7b DMA_TO_DEVICE dma map error not checked
> [  218.459772] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c423 .. end of dump
> [  218.473504] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c716 .. start dump
> [  218.475662] r8169 0000:0b:00.0: single idx 586 P=3c7160c0 N=3c716 D=c940c0 L=36 DMA_TO_DEVICE dma map error checked
> [  218.477874] r8169 0000:0b:00.0: single idx 586 P=3c716280 N=3c716 D=c95280 L=36 DMA_TO_DEVICE dma map error checked
> [  218.480075] r8169 0000:0b:00.0: single idx 587 P=3c716440 N=3c716 D=c96440 L=36 DMA_TO_DEVICE dma map error checked
> [  218.482245] r8169 0000:0b:00.0: single idx 587 P=3c716600 N=3c716 D=c97600 L=36 DMA_TO_DEVICE dma map error checked
> [  218.484390] r8169 0000:0b:00.0: single idx 588 P=3c7167c0 N=3c716 D=c987c0 L=42 DMA_TO_DEVICE dma map error checked
> [  218.486510] r8169 0000:0b:00.0: single idx 588 P=3c7169c0 N=3c716 D=c999c0 L=36 DMA_TO_DEVICE dma map error checked
> [  218.488603] r8169 0000:0b:00.0: single idx 589 P=3c716b80 N=3c716 D=c9ab80 L=42 DMA_TO_DEVICE dma map error checked
> [  218.490682] r8169 0000:0b:00.0: single idx 589 P=3c716d80 N=3c716 D=c9bd80 L=42 DMA_TO_DEVICE dma map error checked
> [  218.492735] r8169 0000:0b:00.0: single idx 590 P=3c716f80 N=3c716 D=c9cf80 L=42 DMA_TO_DEVICE dma map error not checked
> [  218.494788] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c716 .. end of dump
> 
> --
> Sander
> 


Incoming frames might be taken out of order-3 pages.

With regular Ethernet frames, this is 21 frames per order-3 pages.

ACTIVE_PFN_MAX_OVERLAP seems too small.

Alternative would be to user order-0 only pages if CONFIG_DMA_API_DEBUG
is set. Not sure if it works if PAGE_SIZE=66536 ....

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f589c9af8cbf..1b9995adfd29 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1924,7 +1924,11 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
+#if defined(CONFIG_DMA_API_DEBUG)
+#define NETDEV_FRAG_PAGE_MAX_ORDER 0
+#else
 #define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
+#endif
 #define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
 #define NETDEV_PAGECNT_MAX_BIAS	   NETDEV_FRAG_PAGE_MAX_SIZE
 

^ permalink raw reply related

* Re: [RFC 2/2] xen-netback: disable multicast and use a random hw MAC address
From: Luis R. Rodriguez @ 2014-02-11 21:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev@vger.kernel.org, xen-devel, Paul Durrant, Wei Liu, kvm,
	linux-kernel@vger.kernel.org
In-Reply-To: <1392108205.22033.16.camel@dagon.hellion.org.uk>

Cc'ing kvm folks as they may have a shared interest on the shared
physical case with the bridge (non NAT).

On Tue, Feb 11, 2014 at 12:43 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Mon, 2014-02-10 at 14:29 -0800, Luis R. Rodriguez wrote:
>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>
>> Although the xen-netback interfaces do not participate in the
>> link as a typical Ethernet device interfaces for them are
>> still required under the current archtitecture. IPv6 addresses
>> do not need to be created or assigned on the xen-netback interfaces
>> however, even if the frontend devices do need them, so clear the
>> multicast flag to ensure the net core does not initiate IPv6
>> Stateless Address Autoconfiguration.
>
> How does disabling SAA flow from the absence of multicast?

See patch 1 in this series [0], but I explain the issue I see with
this on the cover letter [1]. In summary the RFCs on IPv6 make it
clear you need multicast for Stateless address autoconfiguration
(SLAAC is the preferred acronym) and DAD, however the net core has not
made this a requirement, and hence the patch. The caveat which I
address on the cover letter needs to be seriously considered though.

[0] http://marc.info/?l=linux-netdev&m=139207142110535&w=2
[1] http://marc.info/?l=linux-netdev&m=139207142110536&w=2

> Surely these should be controlled logically independently even if there is some
> notional linkage.

When a node hops on a network it will query its network by sending a
router solicitation multicast request for its configuration
parameters, the router can respond with router advertisements to
disable SLAAC.

Apart from that we have no other means to disable SLAAC neatly, and as
I gather that would be counter to the IPv6 RFCs anyway, and that makes
sense.

> Can SAA not be disabled directly?

Nope. The ipv6 core assumes all device want ipv6 and this is done upon
netdev registration, and as I noted on my patch 1 description --
although ipv6 supports a module parameter to disable autoconfiguration
RFC4682 Section 5.4 makes it clear that DAD *MUST* be performed on all
unicast addresses prior to assigning them to an interface, regardless
of
whether they are obtained through SLAAC, DHCPv6, or manual configuration.

Upon NETDEV_REGISTER the ipv6 core has 2 struct ipv6_devconf sets of
configurations which could get slapped onto devices, neither of these
disable autoconfiguration, its not a knob with a design purpose to let
devices disable freely -- and technically the RFCs for IPv6 simply
imply that you should not use IPv6 if you do cannot support multicast.
Given that the noautoconf module parameter exists though I think my
patch can be considered upstream after addressing the caveat I noted
on not-NBMA links (and I now I think I know how to address that, we
can just make the MULTICAST flag more meaninful for the dev->type).

A nasty hack to disable IPv6 for testing purposes is to set the MTU to
something lower than IPV6_MIN_MTU (1280) and in fact IPv4 can also be
disabled by setting it to 67, that will disable both IPv6 and IPv4.
That's obviously just that, a nasty nasty hack, but useful for easy
testing of disabling either ipv6 completely or both.

>>  Clearing the multicast
>> flag is required given that the net_device is using the
>> ether_setup() helper.
>>
>> There's also no good reason why the special MAC address of
>> FE:FF:FF:FF:FF:FF is being used other than to avoid issues
>> with STP,
>
> With your change there is a random probability on reboot that the bridge
> will end up with a randomly generated MAC address instead of a static
> MAC address (usually that of the physical NIC on the bridge), since the
> bridge tends to inherit the lowest MAC of any port.

I had not considered the bridge taking the lowest MAC address of any
port added! So that was one of the tricks with the fixed MAC address
of FE:FF:FF:FF:FF:FF, to ensure the bridge would skip using its mac
address when it was later added as a port. Another collateral issue if
this is *not* considered is that if a xen-netback interface has a MAC
address lower than the general interface one and if you shutdown that
guest, and therefore removed it from the bridge, the bridge MAC
address will also change once again back to the general interface one.
This will cause a hiccup on accessing the device, while ARP settles
things. If doing a massive shutdown / reboot of guests that have a
series of MAC addresses lower than the general interface one is a
series of MAC address changes on the bridge.

FWIW kvm seems to completely randomize the MAC address of the backend
TAP interfaces (while the front end virtio driver fully randomizes
it), but note that in the NAT use case where only the TAP interfaces
get added the above is not an issue, although I suspect if the shared
connection is used this could be a problem, it will depend on what
tools create the TAP interface and how.

I suspect we may have a shared concern here and I wonder if kvm hit
the snags described above on the shared physical cases. Curious if kvm
folks have seen these issues?

> Since IP configuration is done on the bridge this will break DHCP,
> whether it is using static or dynamic mappings from MAC to IP address,
> and the host will randomly change IP address on reboot.

Its beyond that, as I noted as well there can be issues upon shutdown.

> So Nack for that reason.

Makes sense. Will think about this a bit more.

>>  since using this can create an issue if a user
>> decides to enable multicast on the backend interfaces
>
> Please explain what this issue is.

I explained this on the cover letter but should have elaborated more
here. The *known* and *reported* issue is that xen-backend interfaces
can end up  SLAAC and you'd obviously end up in some situations where
the MAC address and IP address clash, despite the architecture of IPv6
to randomize time requests for neighbor solicitations, and DAD.
Ultimately a series of services can end up filling your log messages
with tons of warnings.

Another not reported issue, but I suspect critical and it can bite
both xen and kvm in the ass is described on Appendex A on RFC 4862 [2]
which considers the issues of getting duplicates of packets on the
same link with the same link layer address. I think to address that we
can also consider dev->type into all the different cases.

What drove these patches is trying to find a proper upstream approach
to Olaf's old xen ipv6-no-autoconf patch [3]. Although not stated on
the patch I have seen some old year 2006 internal reports even with
the static FE:FF:FF:FF:FF:FF MAC address, whereby the xen-netback
interfaces kicked off IPv6 SLAAC and DAD. IPv6 SLAAC should trigger
once the link goes up.

My preference, rather than trying to simply disable ipv6 is actually
seeing how xen-netback interfaces (and kvm TAP topology) can be
simplified further). As I see it there is tons of code which could
trigger being used on these xen-netback interfaces (and TAP for kvm)
which is simply not needed for the use case of just doing sending data
back and forth between host and guest: ipv6 is not needed at all, and
I tried to test removing ipv4, but ran into issues.

[2] http://tools.ietf.org/html/rfc4862#appendix-A
[3[ https://gitorious.org/opensuse/kernel-source/source/8e16582178a29b03e850468004a47e7be5ed3005:patches.xen/ipv6-no-autoconf

> Also how can a user enable multicast on the b/e?

ip set multicast on dev <devname>
ip set multicast off dev <devname>

> AFAIK only Solaris ever
> implemented the m/c bits of the Xen PV network protocol (not that I
> wouldn't welcome attempts to add it to other platforms)

Do you mean kernel configuration multicast ? Or networking ?

 Luis

^ permalink raw reply

* fe80::/64 route missing on GRE tunnels
From: Steinar H. Gunderson @ 2014-02-11 21:55 UTC (permalink / raw)
  To: netdev; +Cc: itk-intern

Hi,

It seems that recent kernels no longer automatically add the link-local route
to GRE tunnels. 3.10.27 does it right:

  root@morgental:~# ip tunnel add foo mode gre remote 1.2.3.4 ttl 64
  root@morgental:~# ip link set foo up mtu 1468
  root@morgental:~# ip -6 route show dev foo
  fe80::/64  proto kernel  metric 256 

but on 3.13.1, no such route shows up. I can add it manually, though.

This broke our in-house IPv6 tunnel broker (which essentially uses link-local
addresses over GRE tunnels to bring up BGP sessions). Do you think you could
have a look?

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply

* Re: fe80::/64 route missing on GRE tunnels
From: Steinar H. Gunderson @ 2014-02-11 21:59 UTC (permalink / raw)
  To: netdev, itk-intern
In-Reply-To: <20140211215510.GA6994@sesse.net>

On Tue, Feb 11, 2014 at 10:55:11PM +0100, Steinar H. Gunderson wrote:
> but on 3.13.1, no such route shows up. I can add it manually, though.

Correction; I can add it manually, but only to one GRE device at a time:

  root@altersex:~$ ip -6 route add fe80::/64 dev k_molvenfinnoy
  root@altersex:~$ ip -6 route add fe80::/64 dev k_sessesveits
  RTNETLINK answers: File exists
  root@altersex:~$ ip -6 route del fe80::/64 dev k_molvenfinnoy
  root@altersex:~$ ip -6 route add fe80::/64 dev k_sessesveits
  root@altersex:~$

/* Steinar */
-- 
ITK-pang
http://www.sesse.net/

^ permalink raw reply

* Q: why calling kobject_put() twice?
From: Cong Wang @ 2014-02-11 22:24 UTC (permalink / raw)
  To: netdev; +Cc: David Miller

Hi,

There are many drivers either using free_netdev() as ->destructor() or
calling it in their own destructor(), and in free_netdev() we call
put_device(&dev->dev) at last. But in netdev_run_todo(), we call
kobject_put(&dev->dev.kobj) again after calling ->destructor(). So,
what's the point of doing this?

Also, if free_netdev() is supposed to free the netdev as its name
tells, then why it's still safe to read some field of it after that?

I must miss something here.

Thanks.

^ permalink raw reply

* Re: large degradation in ip netns add/exec performance in 3.13?
From: Rick Jones @ 2014-02-11 22:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <52F1461D.8050703@hp.com>

On 02/04/2014 11:57 AM, Rick Jones wrote:
> Hi -
>
> I have a dinky little script which creates what I've been calling "fake
> routers."  It is far from a complete fake router, but it shows what
> appears to be a very large degradation in performance in 3.13 compared
> to 3.12.9 which itself is slow compared to a 3.5.0-44 kernel canonical
> kernel with some upstream commits included:
>
>
> Start/End    Average Rate of Creation per Second
> "Router" Count  3.5.0-44+  3.12.9  3.13.0
> ------------------------------------------------------
> 0 to 250          7.58      5.56    2.55
> 250 to 500        7.14      5.81    2.55
> 500 to 750        6.41      5.56    2.55
> 750 to 1000       6.10      4.90    2.55
> 1000 to 1250      5.68      4.39    2.50
> 1250 to 1500      5.21      4.24    2.36
> 1500 to 1750      5.00      3.85    2.23
> 1750 to 2000      4.81      3.62    2.21
> 2000 to 2250      4.55      3.47    2.21
> 2250 to 2500      4.31      3.29    2.14
> 2500 to 2750      4.03      3.09    2.05
> 2750 to 3000      3.73      3.09    2.02
> 3000 to 3250      3.62      2.81    2.02
> 3250 to 3500      3.38      2.72    1.97
> 3500 to 3750      3.21      2.55    1.92
> 3750 to 4000      3.01      2.48    1.87

I see that the 3.13.0 kernel will scale rather well - out to 16 
concurrent streams creating these "fake routers" - it drops-off at 32, 
but this system is only 16 cores, 32 threads (two socket E5-2670) so I 
suppose that really shouldn't come as a surprise.

I did have the system lockup once at 16 concurrent streams on the 3.13.0 
kernel.  Didn't happen the next two times I tried.

happy benchmarking,

rick jones

Time,3.5.0-44+ 4streams,3.13.0 4 streams,3.13.0 8 streams,3.13.0 16 
streams,3.13.0 32 streams
0,0,0,0,0,0
30,518,419,840,1357,1327
60,914,826,1501,1988,1905
90,1242,1189,1995,2465,2332
120,1532,1519,2403,2872,2680
150,1792,1811,2751,3228,2985
180,2060,2089,3076,3537,3278
210,,2347,3359,3830,3542
232,,,,4016,
240,2447,2588,3627,,3778
270,,,3867,,
276,,,,,4032
289,,,4008,,
300,2808,2999,,,
330,,,,,
360,3131,3379,,,
390,,,,,
420,3418,3747,,,
450,,,,,
468,,4004,,,
480,3690,,,,
510,,,,,
540,3945,,,,
556,4004,,,,

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox