Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next RFC 1/5] net-timestamp: no-payload option
From: Andy Lutomirski @ 2015-01-09 20:55 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David S. Miller, Richard Cochran,
	Eric Dumazet
In-Reply-To: <CA+FuTSdWWv65CNQYv1d9+WG1X54eGyP_+mbh3dMHw_HmH8wzYA@mail.gmail.com>

On Fri, Jan 9, 2015 at 12:33 PM, Willem de Bruijn <willemb@google.com> wrote:
> On Fri, Jan 9, 2015 at 3:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Jan 9, 2015 at 11:47 AM, Willem de Bruijn <willemb@google.com> wrote:
>>> On Fri, Jan 9, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Jan 9, 2015 at 9:31 AM, Willem de Bruijn <willemb@google.com> wrote:
>>>>> From: Willem de Bruijn <willemb@google.com>
>>>>>
>>>>> Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
>>>>> timestamps, this loops timestamps on top of empty packets.
>>>>>
>>>>> Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
>>>>> cmsg reception (aside from timestamps) are no longer possible. This
>>>>> works together with a follow on patch that allows administrators to
>>>>> only allow tx timestamping if it does not loop payload or metadata.
>>>>
>>>> If this loses IP_PKTINFO, that will be a bit unfortunate.
>>>>
>>>
>>> If it doesn't, then we might as well loop the entire payload. For applications
>>> that need pktinfo or other cmsg, do not select the option.
>>
>> Right, but it loses the ability to get the ifindex if the sysctl is
>> set to the conservative option, which I don't think is desirable.
>
> Understood. I just find the alternative, where the no-data policy is
> weakened by exceptions, even less desirable. That makes it
> harder to explain what the sysctl/option do and what the security
> implications are.

Agreed.

If there was no-payload but not no-cmsg, then it would be a nice
middle ground, but I guess that's bad for some reason involving
accounting?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: avoid reducing cwnd when ACK+DSACK is received
From: Neal Cardwell @ 2015-01-09 20:36 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, Sébastien Barré, David Miller, Netdev,
	Gregory Detal, Nandita Dukkipati
In-Reply-To: <CAK6E8=dhW=0hRYHjjgBh0KwKOQfzHhwY2egusQ8S4tsoJb0Nsg@mail.gmail.com>

On Fri, Jan 9, 2015 at 2:43 PM, Yuchung Cheng <ycheng@google.com> wrote:
> Sebastien: I suggest breaking down by ACK types for readability. e.g.,

I like this approach, as well.  It breaks up the logic into smaller
pieces that are easier to comment and understand, without having
helper bool flags to read and mentally track. And this version passes
all our internal TLP tests, FWIW.

neal

^ permalink raw reply

* Re: [PATCH net-next RFC 1/5] net-timestamp: no-payload option
From: Willem de Bruijn @ 2015-01-09 20:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Network Development, David S. Miller, Richard Cochran,
	Eric Dumazet
In-Reply-To: <CALCETrWZy0=OnYhyCB8U7Km7vMcjB+Vc8ahm+ea4=0YiiPBCAQ@mail.gmail.com>

On Fri, Jan 9, 2015 at 3:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jan 9, 2015 at 11:47 AM, Willem de Bruijn <willemb@google.com> wrote:
>> On Fri, Jan 9, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Jan 9, 2015 at 9:31 AM, Willem de Bruijn <willemb@google.com> wrote:
>>>> From: Willem de Bruijn <willemb@google.com>
>>>>
>>>> Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
>>>> timestamps, this loops timestamps on top of empty packets.
>>>>
>>>> Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
>>>> cmsg reception (aside from timestamps) are no longer possible. This
>>>> works together with a follow on patch that allows administrators to
>>>> only allow tx timestamping if it does not loop payload or metadata.
>>>
>>> If this loses IP_PKTINFO, that will be a bit unfortunate.
>>>
>>
>> If it doesn't, then we might as well loop the entire payload. For applications
>> that need pktinfo or other cmsg, do not select the option.
>
> Right, but it loses the ability to get the ifindex if the sysctl is
> set to the conservative option, which I don't think is desirable.

Understood. I just find the alternative, where the no-data policy is
weakened by exceptions, even less desirable. That makes it
harder to explain what the sysctl/option do and what the security
implications are.

^ permalink raw reply

* [PATCH] MAINTAINERS: add me as ibmveth maintainer
From: Thomas Falcon @ 2015-01-09 20:29 UTC (permalink / raw)
  To: netdev; +Cc: Santiago Leon, Brian King, Nathan Fontenot

Adding myself as the ibmveth maintainer and replacing
Santiago Leon.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Cc: Santiago Leon <santi_leon@yahoo.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index ddb9ac8..5cb9f6f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4748,7 +4748,7 @@ S:	Supported
 F:	drivers/scsi/ipr.*
 
 IBM Power Virtual Ethernet Device Driver
-M:	Santiago Leon <santil@linux.vnet.ibm.com>
+M:	Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/ibm/ibmveth.*
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH] net/fsl: Add mEMAC MDIO support to XGMAC MDIO
From: Emil Medve @ 2015-01-09 20:11 UTC (permalink / raw)
  To: shh.xie, netdev, davem; +Cc: Andy Fleming, Shaohui Xie
In-Reply-To: <1420364162-13109-1-git-send-email-shh.xie@gmail.com>

Hello Shao-Hui,


On 01/04/2015 03:36 AM, shh.xie@gmail.com wrote:
> From: Andy Fleming <afleming@gmail.com>
> 
> The Freescale mEMAC supports operating at 10/100/1000/10G, and
> its associated MDIO controller is likewise capable of operating
> both Clause 22 and Clause 45 MDIO buses. It is nearly identical
> to the MDIO controller on the XGMAC, so we just modify that
> driver.
> 
> Portions of this driver developed by:
> 
> Sandeep Singh <sandeep@freescale.com>
> Roy Zang <tie-fei.zang@freescale.com>
> 
> Signed-off-by: Andy Fleming <afleming@gmail.com>
> Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
> ---
>  drivers/net/ethernet/freescale/Kconfig      |  3 +-
>  drivers/net/ethernet/freescale/xgmac_mdio.c | 64 ++++++++++++++++++++++++-----
>  2 files changed, 55 insertions(+), 12 deletions(-)
> 
> 	...
> 
> diff --git a/drivers/net/ethernet/freescale/xgmac_mdio.c b/drivers/net/ethernet/freescale/xgmac_mdio.c
> index a352445..e0fc3d1 100644
> --- a/drivers/net/ethernet/freescale/xgmac_mdio.c
> +++ b/drivers/net/ethernet/freescale/xgmac_mdio.c
> 
> 	...
> 
> @@ -123,21 +144,39 @@ static int xgmac_mdio_write(struct mii_bus *bus, int phy_id, int regnum, u16 val
>  static int xgmac_mdio_read(struct mii_bus *bus, int phy_id, int regnum)
>  {
>  	struct tgec_mdio_controller __iomem *regs = bus->priv;
> -	uint16_t dev_addr = regnum >> 16;
> +	uint16_t dev_addr;
> +	uint32_t mdio_stat;
>  	uint32_t mdio_ctl;
>  	uint16_t value;
>  	int ret;
>  
> +	mdio_stat = in_be32(&regs->mdio_stat);
> +	if (regnum & MII_ADDR_C45) {
> +		dev_addr = (regnum >> 16) & 0x1f;
> +		mdio_stat |= MDIO_STAT_ENC;
> +	} else {
> +		dev_addr = regnum & 0x1f;
> +		mdio_stat = ~MDIO_STAT_ENC;

Shouldn't this be 'mdio_stat &= ~MDIO_STAT_ENC'?


Cheers,

^ permalink raw reply

* Re: [PATCH net-next RFC 1/5] net-timestamp: no-payload option
From: Andy Lutomirski @ 2015-01-09 20:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David S. Miller, Richard Cochran,
	Eric Dumazet
In-Reply-To: <CA+FuTSf4SqTuLmrJ971Wu3eaqc94qd93M_OCa9QCCDHOYwxF8A@mail.gmail.com>

On Fri, Jan 9, 2015 at 11:47 AM, Willem de Bruijn <willemb@google.com> wrote:
> On Fri, Jan 9, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Jan 9, 2015 at 9:31 AM, Willem de Bruijn <willemb@google.com> wrote:
>>> From: Willem de Bruijn <willemb@google.com>
>>>
>>> Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
>>> timestamps, this loops timestamps on top of empty packets.
>>>
>>> Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
>>> cmsg reception (aside from timestamps) are no longer possible. This
>>> works together with a follow on patch that allows administrators to
>>> only allow tx timestamping if it does not loop payload or metadata.
>>
>> If this loses IP_PKTINFO, that will be a bit unfortunate.
>>
>
> If it doesn't, then we might as well loop the entire payload. For applications
> that need pktinfo or other cmsg, do not select the option.

Right, but it loses the ability to get the ifindex if the sysctl is
set to the conservative option, which I don't think is desirable.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next RFC 1/5] net-timestamp: no-payload option
From: Willem de Bruijn @ 2015-01-09 19:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Network Development, David S. Miller, Richard Cochran,
	Eric Dumazet
In-Reply-To: <CALCETrU2kFYsYnhPnBWd3059k5Z6=8B==kwXU0xidAsR6EqgCg@mail.gmail.com>

On Fri, Jan 9, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jan 9, 2015 at 9:31 AM, Willem de Bruijn <willemb@google.com> wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
>> timestamps, this loops timestamps on top of empty packets.
>>
>> Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
>> cmsg reception (aside from timestamps) are no longer possible. This
>> works together with a follow on patch that allows administrators to
>> only allow tx timestamping if it does not loop payload or metadata.
>
> If this loses IP_PKTINFO, that will be a bit unfortunate.
>

If it doesn't, then we might as well loop the entire payload. For applications
that need pktinfo or other cmsg, do not select the option.

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: avoid reducing cwnd when ACK+DSACK is received
From: Yuchung Cheng @ 2015-01-09 19:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Sébastien Barré, David Miller, Netdev,
	Gregory Detal, Nandita Dukkipati
In-Reply-To: <1420734325.5947.61.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Jan 8, 2015 at 8:25 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Thu, 2015-01-08 at 10:49 -0500, Neal Cardwell wrote:
> > On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
> > <sebastien.barre@uclouvain.be> wrote:
> > > What do you and Neal think ?
> >
> > My preference is to have the whole expression detecting the case where
> > the receiver got the probe packet encoded in a single expression. I
> > don't have a strong feeling about whether it should be stored in a
> > bool (to reduce the size of the diff) or written directly into the if
> > () expression (to reduce the size of the code). I'll defer to Eric on
> > which he thinks is better. :-)
>
> There is no shame using helpers with nice names to help understand this
> TCP stack. Even if the helper is used exactly once.
>
> In this case, it seems we test 2 different conditions, so this could use
> 2 helpers with self describing names.
>
> When I see :
>
> > +     if (((ack == tp->tlp_high_seq) &&
> > +          !(flag & (FLAG_SND_UNA_ADVANCED |
> > +                    FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
> > +         (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
> >               tp->tlp_high_seq = 0;
>
> My brain hurts.
Sebastien: I suggest breaking down by ACK types for readability. e.g.,

/* This routine deals with acks during a TLP episode.
 * We mark the end of a TLP episode on receiving TLP dupack or when
 * ack is after tlp_high_seq.
 * Ref: loss detection algorithm in draft-dukkipati-tcpm-tcp-loss-probe.
 */
static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
{
        struct tcp_sock *tp = tcp_sk(sk);

        if (before(ack, tp->tlp_high_seq))
                return;

        if (flag & FLAG_DSACKING_ACK) {
                /* This DSACK means original and TLP probe arrived; no loss */
                tp->tlp_high_seq = 0;
        } else if (after(ack, tp->tlp_high_seq)) {
                /* ACK advances: there was a loss, so reduce cwnd. Reset
                 * tlp_high_seq in tcp_init_cwnd_reduction()
                 */
                tcp_init_cwnd_reduction(sk);
                tcp_set_ca_state(sk, TCP_CA_CWR);
                tcp_end_cwnd_reduction(sk);
                tcp_try_keep_open(sk);
                NET_INC_STATS_BH(sock_net(sk),
                                 LINUX_MIB_TCPLOSSPROBERECOVERY);
        } else if (!(flag & (FLAG_SND_UNA_ADVANCED |
                             FLAG_NOT_DUP | FLAG_DATA_SACKED))) {
                /* Pure dupack: original and TLP probe arrived; no loss */
                tp->tlp_high_seq = 0;
        }
}

>
>

^ permalink raw reply

* Re: [PATCH net-next RFC 1/5] net-timestamp: no-payload option
From: Andy Lutomirski @ 2015-01-09 19:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David S. Miller, Richard Cochran,
	Eric Dumazet
In-Reply-To: <1420824719-28848-2-git-send-email-willemb@google.com>

On Fri, Jan 9, 2015 at 9:31 AM, Willem de Bruijn <willemb@google.com> wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
> timestamps, this loops timestamps on top of empty packets.
>
> Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
> cmsg reception (aside from timestamps) are no longer possible. This
> works together with a follow on patch that allows administrators to
> only allow tx timestamping if it does not loop payload or metadata.

If this loses IP_PKTINFO, that will be a bit unfortunate.

--Andy

>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/uapi/linux/net_tstamp.h |  3 ++-
>  net/core/skbuff.c               | 19 ++++++++++++++-----
>  net/ipv4/ip_sockglue.c          |  9 +++++----
>  net/ipv6/datagram.c             |  4 ++--
>  net/rxrpc/ar-error.c            |  5 +++++
>  5 files changed, 28 insertions(+), 12 deletions(-)
>
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index edbc888..6d1abea 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -24,8 +24,9 @@ enum {
>         SOF_TIMESTAMPING_TX_SCHED = (1<<8),
>         SOF_TIMESTAMPING_TX_ACK = (1<<9),
>         SOF_TIMESTAMPING_OPT_CMSG = (1<<10),
> +       SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
>
> -       SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_CMSG,
> +       SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY,
>         SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
>                                  SOF_TIMESTAMPING_LAST
>  };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5a2a2e8..ece2bb8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3710,19 +3710,28 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>                      struct sock *sk, int tstype)
>  {
>         struct sk_buff *skb;
> +       bool tsonly = sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY;
>
>         if (!sk)
>                 return;
>
> -       if (hwtstamps)
> -               *skb_hwtstamps(orig_skb) = *hwtstamps;
> +       if (tsonly)
> +               skb = alloc_skb(0, GFP_ATOMIC);
>         else
> -               orig_skb->tstamp = ktime_get_real();
> -
> -       skb = skb_clone(orig_skb, GFP_ATOMIC);
> +               skb = skb_clone(orig_skb, GFP_ATOMIC);
>         if (!skb)
>                 return;
>
> +       if (tsonly) {
> +               skb_shinfo(skb)->tx_flags = skb_shinfo(orig_skb)->tx_flags;
> +               skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey;
> +       }
> +
> +       if (hwtstamps)
> +               *skb_hwtstamps(skb) = *hwtstamps;
> +       else
> +               skb->tstamp = ktime_get_real();
> +
>         __skb_complete_tx_timestamp(skb, sk, tstype);
>  }
>  EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index a317797..d81ef70 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -440,7 +440,7 @@ static bool ipv4_pktinfo_prepare_errqueue(const struct sock *sk,
>
>         if ((ee_origin != SO_EE_ORIGIN_TIMESTAMPING) ||
>             (!(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_CMSG)) ||
> -           (!skb->dev))
> +           (!skb->dev) || (!skb->len))
>                 return false;
>
>         info->ipi_spec_dst.s_addr = ip_hdr(skb)->saddr;
> @@ -483,7 +483,7 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
>
>         serr = SKB_EXT_ERR(skb);
>
> -       if (sin) {
> +       if (sin && skb->len) {
>                 sin->sin_family = AF_INET;
>                 sin->sin_addr.s_addr = *(__be32 *)(skb_network_header(skb) +
>                                                    serr->addr_offset);
> @@ -496,8 +496,9 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
>         sin = &errhdr.offender;
>         sin->sin_family = AF_UNSPEC;
>
> -       if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
> -           ipv4_pktinfo_prepare_errqueue(sk, skb, serr->ee.ee_origin)) {
> +       if (skb->len &&
> +           (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
> +            ipv4_pktinfo_prepare_errqueue(sk, skb, serr->ee.ee_origin))) {
>                 struct inet_sock *inet = inet_sk(sk);
>
>                 sin->sin_family = AF_INET;
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index 100c589..91a31ea 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -369,7 +369,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
>
>         serr = SKB_EXT_ERR(skb);
>
> -       if (sin) {
> +       if (sin && skb->len) {
>                 const unsigned char *nh = skb_network_header(skb);
>                 sin->sin6_family = AF_INET6;
>                 sin->sin6_flowinfo = 0;
> @@ -394,7 +394,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
>         memcpy(&errhdr.ee, &serr->ee, sizeof(struct sock_extended_err));
>         sin = &errhdr.offender;
>         sin->sin6_family = AF_UNSPEC;
> -       if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
> +       if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL && skb->len) {
>                 sin->sin6_family = AF_INET6;
>                 sin->sin6_flowinfo = 0;
>                 sin->sin6_port = 0;
> diff --git a/net/rxrpc/ar-error.c b/net/rxrpc/ar-error.c
> index 74c0fcd..5394b6b 100644
> --- a/net/rxrpc/ar-error.c
> +++ b/net/rxrpc/ar-error.c
> @@ -42,6 +42,11 @@ void rxrpc_UDP_error_report(struct sock *sk)
>                 _leave("UDP socket errqueue empty");
>                 return;
>         }
> +       if (!skb->len) {
> +               _leave("UDP empty message");
> +               kfree_skb(skb);
> +               return;
> +       }
>
>         rxrpc_new_skb(skb);
>
> --
> 2.2.0.rc0.207.ga3a616c
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Cong Wang @ 2015-01-09 19:37 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <CAOxq_8Na14ES1souk-jBa8A=9ArHc-ffRtSd36zbvwHvGM+24Q@mail.gmail.com>

On Fri, Jan 9, 2015 at 11:08 AM, Ani Sinha <ani@arista.com> wrote:
>
> Perhaps. What I am truly confused about is :
>
> - We are keeping a sysctl interface that does absolutely nothing in
> the kernel and is completely useless in case some userland
> scripts/tools are rendered broken from it's removal.

I am all for getting rid of it, sane script should always check
the sysctl existence first. I think we did remove some sysctl's
from kernel before.

>
> - surprisingly, we contradict ourselves when we let scripts break when
> running from a child namespace because the  same sysctl is no longer
> available!
>

I am not sure how exactly your script is broken, but it should
test the existence of any /proc file before reading it, even when
you don't use netns, because this could be a new sysctl introduced
recently (probably not the case for ip_rt_max_size, but you get my point).

^ permalink raw reply

* Re: [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 19:25 UTC (permalink / raw)
  To: William Dauchy; +Cc: Vadim Kochan, netdev
In-Reply-To: <20150109190656.GA25275@angus-think.lan>

On Fri, Jan 09, 2015 at 09:06:56PM +0200, Vadim Kochan wrote:
> On Fri, Jan 09, 2015 at 06:55:57PM +0100, William Dauchy wrote:
> > On Jan09 19:25, Vadim Kochan wrote:
> > > From: Vadim Kochan <vadim4j@gmail.com>
> > > 
> > > The issue was caused that ifla_vf_rate does not exist on
> > > older kernels and should be checked if it exists as nested attr.
> > > 
> > > Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> > > Reported-by: William Dauchy <william@gandi.net>
> > > Tested-by: William Dauchy <william@gandi.com>
> > 
> > gandi.net actually ;)
> > 
> > Thanks,
> > 
> 
> Sorry, I will re-send.
> 

Uff, sent v3.

^ permalink raw reply

* [PATCH iproute2 v3] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 19:24 UTC (permalink / raw)
  To: netdev; +Cc: william, Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

The issue was caused that ifla_vf_rate does not exist on
older kernels and should be checked if it exists as nested attr.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Reported-by: William Dauchy <william@gandi.net>
Tested-by: William Dauchy <william@gandi.net>
---
 ip/ipaddress.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 28dfe8c..830b166 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 {
 	struct ifla_vf_mac *vf_mac;
 	struct ifla_vf_vlan *vf_vlan;
-	struct ifla_vf_rate *vf_rate;
 	struct ifla_vf_tx_rate *vf_tx_rate;
 	struct ifla_vf_spoofchk *vf_spoofchk;
 	struct ifla_vf_link_state *vf_linkstate;
-	struct rtattr *vf[IFLA_VF_MAX+1];
+	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
 	struct rtattr *tmp;
 	SPRINT_BUF(b1);
 
@@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
 	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
 	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
-	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
 
 	/* Check if the spoof checking vf info type is supported by
 	 * this kernel.
@@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 		fprintf(fp, ", qos %d", vf_vlan->qos);
 	if (vf_tx_rate->rate)
 		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
-	if (vf_rate->max_tx_rate)
-		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
-	if (vf_rate->min_tx_rate)
-		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+
+	if (vf[IFLA_VF_RATE]) {
+		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
+
+		if (vf_rate->max_tx_rate)
+			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
+		if (vf_rate->min_tx_rate)
+			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+	}
+
 	if (vf_spoofchk && vf_spoofchk->setting != -1) {
 		if (vf_spoofchk->setting)
 			fprintf(fp, ", spoof checking on");
-- 
2.1.3

^ permalink raw reply related

* [PATCH iproute2 v2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 19:22 UTC (permalink / raw)
  To: netdev; +Cc: william, Vadim Kochan, Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

The issue was caused that ifla_vf_rate does not exist on
older kernels and should be checked if it exists as nested attr.

Signed-off-by: Vadim Kochan <vadim4j@gmail.net>
Reported-by: William Dauchy <william@gandi.net>
Tested-by: William Dauchy <william@gandi.net>
---
 ip/ipaddress.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 28dfe8c..830b166 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 {
 	struct ifla_vf_mac *vf_mac;
 	struct ifla_vf_vlan *vf_vlan;
-	struct ifla_vf_rate *vf_rate;
 	struct ifla_vf_tx_rate *vf_tx_rate;
 	struct ifla_vf_spoofchk *vf_spoofchk;
 	struct ifla_vf_link_state *vf_linkstate;
-	struct rtattr *vf[IFLA_VF_MAX+1];
+	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
 	struct rtattr *tmp;
 	SPRINT_BUF(b1);
 
@@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
 	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
 	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
-	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
 
 	/* Check if the spoof checking vf info type is supported by
 	 * this kernel.
@@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 		fprintf(fp, ", qos %d", vf_vlan->qos);
 	if (vf_tx_rate->rate)
 		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
-	if (vf_rate->max_tx_rate)
-		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
-	if (vf_rate->min_tx_rate)
-		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+
+	if (vf[IFLA_VF_RATE]) {
+		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
+
+		if (vf_rate->max_tx_rate)
+			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
+		if (vf_rate->min_tx_rate)
+			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+	}
+
 	if (vf_spoofchk && vf_spoofchk->setting != -1) {
 		if (vf_spoofchk->setting)
 			fprintf(fp, ", spoof checking on");
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH iproute2 0/3] ip netns: Run over all netns
From: Vadim Kochan @ 2015-01-09 19:17 UTC (permalink / raw)
  To: Cong Wang; +Cc: Vadim Kochan, netdev
In-Reply-To: <CAHA+R7NPrBw_tN20adU6m0A0Toggo9SO9LpVV7WSaVa51Fyfsg@mail.gmail.com>

On Fri, Jan 09, 2015 at 10:49:50AM -0800, Cong Wang wrote:
> On Wed, Jan 7, 2015 at 4:52 PM, Vadim Kochan <vadim4j@gmail.com> wrote:
> > On Wed, Jan 07, 2015 at 04:04:14PM -0800, Cong Wang wrote:
> >> On Wed, Jan 7, 2015 at 3:04 AM, Vadim Kochan <vadim4j@gmail.com> wrote:
> >> > From: Vadim Kochan <vadim4j@gmail.com>
> >> >
> >> > Allow 'ip netns del' and 'ip netns exec' run over each network namespace names.
> >> >
> >> > 'ip netns exec' executes command forcely on eacn nsname.
> >> >
> >>
> >> Why this has to be done in iproute command?
> >> That is, why not just offloading this to a shell script like below?
> >>
> >> for ns in `ip netns show`;
> >> do
> >>      ip netns exec $ns ip link show.....
> >> done
> >
> > Hm, but would not it better to have it in iproute instead of collect
> > scripts ? Scripts allows to do a lot of things, but in this case it seems like a
> > feature which related to iproute.
> 
> iproute2 should keep a minimum set of features especially when
> a one-liner shell script can do that.

BTW, this script should have additional output of netns name
before the 'ip netns exec $ns ...' invocation.

^ permalink raw reply

* Re: [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 19:06 UTC (permalink / raw)
  To: William Dauchy; +Cc: Vadim Kochan, netdev
In-Reply-To: <20150109175557.GF31810@gandi.net>

On Fri, Jan 09, 2015 at 06:55:57PM +0100, William Dauchy wrote:
> On Jan09 19:25, Vadim Kochan wrote:
> > From: Vadim Kochan <vadim4j@gmail.com>
> > 
> > The issue was caused that ifla_vf_rate does not exist on
> > older kernels and should be checked if it exists as nested attr.
> > 
> > Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> > Reported-by: William Dauchy <william@gandi.net>
> > Tested-by: William Dauchy <william@gandi.com>
> 
> gandi.net actually ;)
> 
> Thanks,
> 

Sorry, I will re-send.

^ permalink raw reply

* Re: [PATCH v2 net-next] bridge: Add ability to enable TSO
From: Pankaj Gupta @ 2015-01-09 19:10 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: David S . Miller, Stephen Hemminger, netdev, bridge
In-Reply-To: <54AF73AD.4000008@lab.ntt.co.jp>


> > 
> >>
> >> Currently a bridge device turns off TSO feature if no bridge ports
> >> support it. We can always enable it, since packets can be segmented on
> >> ports by software as well as on the bridge device.
> >> This will reduce the number of packets processed in the bridge.
> >>
> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >> ---
> >> v2: Use an existing helper function.
> >>
> >>  net/bridge/br_if.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> >> index ed307db..81e49fb 100644
> >> --- a/net/bridge/br_if.c
> >> +++ b/net/bridge/br_if.c
> >> @@ -424,6 +424,7 @@ netdev_features_t br_features_recompute(struct
> >> net_bridge
> >> *br,
> >>  		features = netdev_increment_features(features,
> >>  						     p->dev->features, mask);
> >>  	}
> >> +	features = netdev_add_tso_features(features, mask);
> > 
> > Just a doubt. Are we inducing latency if source has traffic at very low
> > rate.
> > I mean by default do we need it?
> 
> Is your concern tcp_tso_should_defer() in tcp_write_xmit()?
yes.

> If so, since TSO packet is created by GSO even without this patch, this
> should not increase latency there.
> This patch just delays the point of software segmentation from the
> bridge device to its port.

I think now I got your point. 

Thanks,
Pankaj
> 
> Thanks,
> Toshiaki Makita
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-09 19:08 UTC (permalink / raw)
  To: Cong Wang; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <CAHA+R7Ps2_+5NgjE8HNbrTxbjhtt4nJyK9y5HrbHez2PhZEiPQ@mail.gmail.com>

On Fri, Jan 9, 2015 at 10:47 AM, Cong Wang <cwang@twopensource.com> wrote:
> On Thu, Jan 8, 2015 at 12:13 PM, Ani Sinha <ani@arista.com> wrote:
>> On Thu, Jan 8, 2015 at 11:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>> If you want to use network namespaces, you have to adapt your scripts.
>>>
>>> Nobody claimed network namespaces were totally transparent.
>>>
>>
>> I see. I am going back to an old thread here where Linus says that the
>> #1 rule is:
>>
>> ""We don't regress user space"
>>
>> https://lkml.org/lkml/2013/7/16/565
>>
>> Breaking scripts seems to me to fall into the category of regressing
>> userspace. Or may be we can treat these sysctls more softly since they
>> are not strictly speaking linux ABIs.
>
> As Eric said, it has been like this since day 0,

I beg to differ. It has not been like this for that particular sysctl
from day 0. That sysctl was available from a child namespace and now
it isn't.

why you still think
> we break something? It is you who misunderstands the interface
> not us who break your script.

Perhaps. What I am truly confused about is :

- We are keeping a sysctl interface that does absolutely nothing in
the kernel and is completely useless in case some userland
scripts/tools are rendered broken from it's removal.

- surprisingly, we contradict ourselves when we let scripts break when
running from a child namespace because the  same sysctl is no longer
available!

When the source is available for a script or tool, it's easy to change
the code to conform to the new semantics. However, for old binaries
for which we do not have any source, it's not easy or is impossible to
fix them.

I rest my case. We will of course find a way to fix our code if that
is what netdev thinks is the way to go.

^ permalink raw reply

* Re: [PATCH iproute2 0/3] ip netns: Run over all netns
From: Cong Wang @ 2015-01-09 18:49 UTC (permalink / raw)
  To: Vadim Kochan; +Cc: netdev
In-Reply-To: <20150108005212.GA10487@angus-think.lan>

On Wed, Jan 7, 2015 at 4:52 PM, Vadim Kochan <vadim4j@gmail.com> wrote:
> On Wed, Jan 07, 2015 at 04:04:14PM -0800, Cong Wang wrote:
>> On Wed, Jan 7, 2015 at 3:04 AM, Vadim Kochan <vadim4j@gmail.com> wrote:
>> > From: Vadim Kochan <vadim4j@gmail.com>
>> >
>> > Allow 'ip netns del' and 'ip netns exec' run over each network namespace names.
>> >
>> > 'ip netns exec' executes command forcely on eacn nsname.
>> >
>>
>> Why this has to be done in iproute command?
>> That is, why not just offloading this to a shell script like below?
>>
>> for ns in `ip netns show`;
>> do
>>      ip netns exec $ns ip link show.....
>> done
>
> Hm, but would not it better to have it in iproute instead of collect
> scripts ? Scripts allows to do a lot of things, but in this case it seems like a
> feature which related to iproute.

iproute2 should keep a minimum set of features especially when
a one-liner shell script can do that.

^ permalink raw reply

* Re: [PATCH] bonding: cleanup bond_opts array
From: Andy Gospodarek @ 2015-01-09 18:48 UTC (permalink / raw)
  To: Jonathan Toppins; +Cc: netdev, shm, Nikolay Aleksandrov
In-Reply-To: <1420828268-10360-1-git-send-email-jtoppins@cumulusnetworks.com>

On Fri, Jan 09, 2015 at 01:31:08PM -0500, Jonathan Toppins wrote:
> Remove the empty array element initializer and size the array with
> BOND_OPT_LAST so the compiler will complain if more elements are in
> there than should be.
> 
> An interesting unwanted side effect of this initializer is that if one
> inserts new options into the middle of the array then this initializer
> will zero out the option that equals BOND_OPT_TLB_DYNAMIC_LB+1.
> 
> Example:
> Extend the OPTS enum:
> enum {
>    ...
>    BOND_OPT_TLB_DYNAMIC_LB,
>    BOND_OPT_LACP_NEW1,
>    BOND_OPT_LAST
> };
> 
> Now insert into bond_opts array:
> static const struct bond_option bond_opts[] = {
>       ...
>       [BOND_OPT_LACP_RATE] = { .... unchanged stuff .... },
>       [BOND_OPT_LACP_NEW1] = { ... new stuff ... },
>       ...
>       [BOND_OPT_TLB_DYNAMIC_LB] = { .... unchanged stuff ....},
>       { } // MARK A
> };
> 
> Since BOND_OPT_LACP_NEW1 = BOND_OPT_TLB_DYNAMIC_LB+1, the last
> initializer (MARK A) will overwrite the contents of BOND_OPT_LACP_NEW1
> and can be easily viewed with the crash utility.
> 
> Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
> Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
> Cc: Nikolay Aleksandrov <nikolay@redhat.com>

I do not recall if there was a specific reason that Nik added this, so
presuming there was not....

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>

> ---
>  drivers/net/bonding/bond_options.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
> index 1a61cc9..9bd538d4 100644
> --- a/drivers/net/bonding/bond_options.c
> +++ b/drivers/net/bonding/bond_options.c
> @@ -186,7 +186,7 @@ static const struct bond_opt_value bond_tlb_dynamic_lb_tbl[] = {
>  	{ NULL,  -1, 0}
>  };
>  
> -static const struct bond_option bond_opts[] = {
> +static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>  	[BOND_OPT_MODE] = {
>  		.id = BOND_OPT_MODE,
>  		.name = "mode",
> @@ -379,8 +379,7 @@ static const struct bond_option bond_opts[] = {
>  		.values = bond_tlb_dynamic_lb_tbl,
>  		.flags = BOND_OPTFLAG_IFDOWN,
>  		.set = bond_option_tlb_dynamic_lb_set,
> -	},
> -	{ }
> +	}
>  };
>  
>  /* Searches for an option by name */
> -- 
> 1.7.10.4
> 

^ permalink raw reply

* Re: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of fdb entries learnt via br_fdb_external_learn_add()
From: Scott Feldman @ 2015-01-09 18:47 UTC (permalink / raw)
  To: Arad, Ronen; +Cc: Jiri Pirko, Siva Mannem, Netdev
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DDA643@ORSMSX101.amr.corp.intel.com>

On Fri, Jan 9, 2015 at 8:15 AM, Arad, Ronen <ronen.arad@intel.com> wrote:
>
>
>>-----Original Message-----
>>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>>Behalf Of Scott Feldman
>>Sent: Friday, January 09, 2015 3:47 AM
>>To: Jiri Pirko
>>Cc: Siva Mannem; Netdev
>>Subject: Re: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of
>>fdb entries learnt via br_fdb_external_learn_add()
>>
>>On Wed, Jan 7, 2015 at 4:53 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Dec 30, 2014 at 07:20:21PM CET, siva.mannem.lnx@gmail.com wrote:
>>>>Hi,
>>>>
>>>>I am trying to understand the ongoing switch device offload effort and
>>>>am following the discussions. I have a question regarding
>>>>IFLA_BRPORT_LEARNING_SYNC flag and how aging happens when this flag is
>>>>enabled on a port that is attached to a bridge that has vlan filtering
>>>>enabled.
>>>>
>>>>If I understand correctly, when  IFLA_BRPORT_LEARNING_SYNC is set on a
>>>>bridge port, fdb entries that are learnt externally(may be learnt by
>>>>hardware and driver is notified) are synced to bridges fdb using
>>>>br_fdb_external_learn_add(). The fdb
>>>>entries(fdb->added_by_external_learn set to true) that are learnt via
>>>>this method are also deleted by the aging logic after the aging time
>>>>even though L2 data forwadring  happens in hardware.
>>
>>This is correct...
>>
>>>> Is there a way
>>>>where aging can be disabled for these entries? and let the entries be
>>>>removed only via br_fdb_external_learn_delete()? or am I missing
>>>>something?
>>>
>>> Currently extenaly learned fdb entries are indeed removed during aging
>>> cleanup. I believe that br_fdb_cleanup should check added_by_external_learn
>>> and not remove that fdbs. What do you think Scott?
>>
>>Something like that would work, if we added another brport flag to
>>control that.  With the current arrangement, using bridge for aging
>>out entries, we want br_fdb_cleanup removing the external_learned
>>fdbs, but if there was another brport flag we could fine tune that.
>>Say new flag is IFLA_BRPORT_AGING_OFFLOAD or something like that.  I'm
>>not sure how aging settings for the bridge are pushed down to offload
>>hw, or if there is a different set for hw.
>>
>>But, isn't it nice to let Linux bridge control aging?  That way,
>>bridge -s fdb dump shows nice statistics on fdb entries.  Hardware
>>isn't involved in the aging processes (other than being told to remove
>>an entry).  Simple hardware equals simple driver.  Linux remains
>>control point.
>>
> It is indeed simpler. However, if the overhead of reading hit bits from the HW
> and updating freshness of entries using br_fdb_external_learn_add() is too
> expensive, it would force such platforms to disable learning sync altogether.
> Therefore, I believe aging offload flag (could be sufficient at bridge level)
> and external aging interval (possibly longer than the software aging interval)
> will encourage drivers to use leaning sync.
>>-scott

I'm not sure I follow that last part.

Can we list out the use-cases to see what's missing?

Case 1: bridge ages out learned_sync'ed entries

bridge port learning: off
offload port learning: on
offload port learning_sync: on

Driver calls br_fdb_external_learn_add() periodically to refresh
bridge fdb entry
to keep it from going stale.
Bridge removes aged out fdb entries (and indirectly tells offload
device to do the same).

Case 2: offload device/bridge age out entries independently

bridge port learning: on
offload port learning: on
offload port learning_sync: off

Bridge ages out its stale learned entries, independent of offload device.
Offload device ages out its stale learned entries, independent of bridge.

Case 3: ?

Please help me finish the use-case list so we can see what's missing.

-scott

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Cong Wang @ 2015-01-09 18:47 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <CAOxq_8M=Ros56U_VbUVumjS226jHs1DK0A-K9aRksTMu88H_sg@mail.gmail.com>

On Thu, Jan 8, 2015 at 12:13 PM, Ani Sinha <ani@arista.com> wrote:
> On Thu, Jan 8, 2015 at 11:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> If you want to use network namespaces, you have to adapt your scripts.
>>
>> Nobody claimed network namespaces were totally transparent.
>>
>
> I see. I am going back to an old thread here where Linus says that the
> #1 rule is:
>
> ""We don't regress user space"
>
> https://lkml.org/lkml/2013/7/16/565
>
> Breaking scripts seems to me to fall into the category of regressing
> userspace. Or may be we can treat these sysctls more softly since they
> are not strictly speaking linux ABIs.

As Eric said, it has been like this since day 0, why you still think
we break something? It is you who misunderstands the interface
not us who break your script.

^ permalink raw reply

* [PATCH RFC net-next] ip_tunnel: create percpu gro_cell
From: Martin KaFai Lau @ 2015-01-09 18:42 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
All skb will be queued to the same cell->napi_skbs.  The
gro_cell_poll is pinned to one core under load. In production traffic,
we also see severe rx_dropped in the tunl iface and it is probably due to
this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog

This patch is trying to alloc_percpu(struct gro_cell) and schedule
gro_cell_poll to process it in the same core.

Setup:
VIP_PREFIX=9.9.9.9/32
REMOTE_REAL_IP=10.228.95.75

if [ "$1" = "encap" ]
then
    sudo ip tunnel add mode ipip remote ${REMOTE_REAL_IP}
    sudo ip link set dev ipip0 up
    sudo ip route add dev ipip0 ${VIP_PREFIX}
else
    # Decapsulating host

    sudo ip tunnel add mode ipip
    sudo ip link set dev tunl0 up
    sudo ip addr add dev lo ${VIP_PREFIX}
    sudo sysctl -a  | grep '\.rp_filter' | awk '{print $1;}' | \
        xargs -n1 -I{} sudo sysctl {}=0
fi

Before:
[root@DECAP ~]# netserver -p 8888
[root@ENCAP ~]# super_netperf 200 -t TCP_RR -H 9.9.9.9 -p 8888 \
-l 30 -- -d 0x6 -m 8k,64k -s 1M -S 1M
332215
[root@DECAP ~]# perf probe -a gro_cell_poll
[root@DECAP ~]# perf stat -I 1000 -a -A -e probe:gro_cell_poll
   117.258518273 CPU0                     0      probe:gro_cell_poll
   117.258518273 CPU1                     0      probe:gro_cell_poll
   117.258518273 CPU2                     0      probe:gro_cell_poll
   117.258518273 CPU3                     0      probe:gro_cell_poll
   117.258518273 CPU4                     0      probe:gro_cell_poll
   117.258518273 CPU5                     0      probe:gro_cell_poll
   117.258518273 CPU6                     0      probe:gro_cell_poll
   117.258518273 CPU7                     0      probe:gro_cell_poll
   117.258518273 CPU8                     0      probe:gro_cell_poll
   117.258518273 CPU9                     0      probe:gro_cell_poll
   117.258518273 CPU10                    0      probe:gro_cell_poll
   117.258518273 CPU11                    0      probe:gro_cell_poll
   117.258518273 CPU12                    0      probe:gro_cell_poll
   117.258518273 CPU13                    0      probe:gro_cell_poll
   117.258518273 CPU14                    0      probe:gro_cell_poll
   117.258518273 CPU15                4,882      probe:gro_cell_poll
   117.258518273 CPU16                    0      probe:gro_cell_poll
   117.258518273 CPU17                    0      probe:gro_cell_poll
   117.258518273 CPU18                    0      probe:gro_cell_poll
   117.258518273 CPU19                    0      probe:gro_cell_poll
   117.258518273 CPU20                    0      probe:gro_cell_poll
   117.258518273 CPU21                    0      probe:gro_cell_poll
   117.258518273 CPU22                    0      probe:gro_cell_poll
   117.258518273 CPU23                    0      probe:gro_cell_poll
   117.258518273 CPU24                    0      probe:gro_cell_poll
   117.258518273 CPU25                    0      probe:gro_cell_poll
   117.258518273 CPU26                    0      probe:gro_cell_poll
   117.258518273 CPU27                    0      probe:gro_cell_poll
   117.258518273 CPU28                    0      probe:gro_cell_poll
   117.258518273 CPU29                    0      probe:gro_cell_poll
   117.258518273 CPU30                    0      probe:gro_cell_poll
   117.258518273 CPU31                    0      probe:gro_cell_poll
   117.258518273 CPU32                    0      probe:gro_cell_poll
   117.258518273 CPU33                    0      probe:gro_cell_poll
   117.258518273 CPU34                    0      probe:gro_cell_poll
   117.258518273 CPU35                    0      probe:gro_cell_poll
   117.258518273 CPU36                    0      probe:gro_cell_poll
   117.258518273 CPU37                    0      probe:gro_cell_poll
   117.258518273 CPU38                    0      probe:gro_cell_poll
   117.258518273 CPU39                    0      probe:gro_cell_poll

After:
[root@DECAP ~]# netserver -p 8888
[root@ENCAP ~]# super_netperf 200 -t TCP_RR -H 9.9.9.9 -p 8888 \
-l 30 -- -d 0x6 -m 8k,64k -s 1M -S 1M
877530
[root@DECAP ~]# perf probe -a gro_cell_poll
[root@DECAP ~]# perf stat -I 1000 -a -A -e probe:gro_cell_poll
    40.085714389 CPU0                13,607      probe:gro_cell_poll
    40.085714389 CPU1                13,188      probe:gro_cell_poll
    40.085714389 CPU2                12,913      probe:gro_cell_poll
    40.085714389 CPU3                12,790      probe:gro_cell_poll
    40.085714389 CPU4                13,395      probe:gro_cell_poll
    40.085714389 CPU5                13,121      probe:gro_cell_poll
    40.085714389 CPU6                11,083      probe:gro_cell_poll
    40.085714389 CPU7                12,945      probe:gro_cell_poll
    40.085714389 CPU8                13,704      probe:gro_cell_poll
    40.085714389 CPU9                13,514      probe:gro_cell_poll
    40.085714389 CPU10                    0      probe:gro_cell_poll
    40.085714389 CPU11                    0      probe:gro_cell_poll
    40.085714389 CPU12                    0      probe:gro_cell_poll
    40.085714389 CPU13                    0      probe:gro_cell_poll
    40.085714389 CPU14                    0      probe:gro_cell_poll
    40.085714389 CPU15                    0      probe:gro_cell_poll
    40.085714389 CPU16                    0      probe:gro_cell_poll
    40.085714389 CPU17                    0      probe:gro_cell_poll
    40.085714389 CPU18                    0      probe:gro_cell_poll
    40.085714389 CPU19                    0      probe:gro_cell_poll
    40.085714389 CPU20               10,402      probe:gro_cell_poll
    40.085714389 CPU21               12,312      probe:gro_cell_poll
    40.085714389 CPU22               11,913      probe:gro_cell_poll
    40.085714389 CPU23               12,964      probe:gro_cell_poll
    40.085714389 CPU24               13,727      probe:gro_cell_poll
    40.085714389 CPU25               12,943      probe:gro_cell_poll
    40.085714389 CPU26               13,558      probe:gro_cell_poll
    40.085714389 CPU27               12,676      probe:gro_cell_poll
    40.085714389 CPU28               13,754      probe:gro_cell_poll
    40.085714389 CPU29               13,379      probe:gro_cell_poll
    40.085714389 CPU30                    0      probe:gro_cell_poll
    40.085714389 CPU31                    0      probe:gro_cell_poll
    40.085714389 CPU32                    0      probe:gro_cell_poll
    40.085714389 CPU33                    0      probe:gro_cell_poll
    40.085714389 CPU34                    0      probe:gro_cell_poll
    40.085714389 CPU35                    0      probe:gro_cell_poll
    40.085714389 CPU36                    0      probe:gro_cell_poll
    40.085714389 CPU37                    0      probe:gro_cell_poll
    40.085714389 CPU38                    0      probe:gro_cell_poll
    40.085714389 CPU39                    0      probe:gro_cell_poll

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/gro_cells.h | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
index 734d9b5..cdac448 100644
--- a/include/net/gro_cells.h
+++ b/include/net/gro_cells.h
@@ -11,22 +11,20 @@ struct gro_cell {
 } ____cacheline_aligned_in_smp;
 
 struct gro_cells {
-	unsigned int		gro_cells_mask;
-	struct gro_cell		*cells;
+	struct gro_cell __percpu	*cells;
 };
 
 static inline void gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
 {
-	struct gro_cell *cell = gcells->cells;
+	struct gro_cell *cell;
 	struct net_device *dev = skb->dev;
 
-	if (!cell || skb_cloned(skb) || !(dev->features & NETIF_F_GRO)) {
+	if (!gcells->cells || skb_cloned(skb) || !(dev->features & NETIF_F_GRO)) {
 		netif_rx(skb);
 		return;
 	}
 
-	if (skb_rx_queue_recorded(skb))
-		cell += skb_get_rx_queue(skb) & gcells->gro_cells_mask;
+	cell = this_cpu_ptr(gcells->cells);
 
 	if (skb_queue_len(&cell->napi_skbs) > netdev_max_backlog) {
 		atomic_long_inc(&dev->rx_dropped);
@@ -72,15 +70,12 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
 {
 	int i;
 
-	gcells->gro_cells_mask = roundup_pow_of_two(netif_get_num_default_rss_queues()) - 1;
-	gcells->cells = kcalloc(gcells->gro_cells_mask + 1,
-				sizeof(struct gro_cell),
-				GFP_KERNEL);
+	gcells->cells = alloc_percpu(struct gro_cell);
 	if (!gcells->cells)
 		return -ENOMEM;
 
-	for (i = 0; i <= gcells->gro_cells_mask; i++) {
-		struct gro_cell *cell = gcells->cells + i;
+	for_each_possible_cpu(i) {
+		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
 
 		skb_queue_head_init(&cell->napi_skbs);
 		netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
@@ -91,16 +86,16 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
 
 static inline void gro_cells_destroy(struct gro_cells *gcells)
 {
-	struct gro_cell *cell = gcells->cells;
 	int i;
 
-	if (!cell)
+	if (!gcells->cells)
 		return;
-	for (i = 0; i <= gcells->gro_cells_mask; i++,cell++) {
+	for_each_possible_cpu(i) {
+		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
 		netif_napi_del(&cell->napi);
 		skb_queue_purge(&cell->napi_skbs);
 	}
-	kfree(gcells->cells);
+	free_percpu(gcells->cells);
 	gcells->cells = NULL;
 }
 
-- 
1.8.1

^ permalink raw reply related

* Re: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Don Dutile @ 2015-01-09 18:38 UTC (permalink / raw)
  To: Bjorn Helgaas, Eli Cohen
  Cc: davem, linux-pci, netdev, ogerlitz, yevgenyp, Eli Cohen
In-Reply-To: <20150109182546.GG6575@google.com>

On 01/09/2015 01:25 PM, Bjorn Helgaas wrote:
> On Sun, Dec 07, 2014 at 03:08:13PM +0200, Eli Cohen wrote:
>> Sometimes it is not desirable to probe the virtual fuctions right away,
>> but rather leave the decision to the host's administrator.
>>
>> This can save host side resource usage by VF instances which would be
>> eventually probed to VMs.
>>
>> Use a parameter to pci_enable_sriov to control that policy, and modify
>> all current callers such that they retain the same functionality.
>>
>> Use a one shot flag on struct pci_device which is cleared after the
>> first probe is ignored so subsequent attempts go through.
>>
>> Cc: Donald Dutile <ddutile@redhat.com>
>> Signed-off-by: Eli Cohen <eli@mellanox.com>
>
> Seems like we never really reached a consensus here.  Please repost if you
> want to continue down this path.
>
> Bjorn
>
hmmm, seems like I have to increase the power on my nack-phaser.
-dd

>> ---
>> This approach is used by the mlx5 driver SRIOV implementation, so
>> sending this to get feedback from the PCI and networking folks.
>>
>>   drivers/misc/genwqe/card_base.c                      |  2 +-
>>   drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |  2 +-
>>   drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |  2 +-
>>   drivers/net/ethernet/cisco/enic/enic_main.c          |  2 +-
>>   drivers/net/ethernet/emulex/benet/be_main.c          |  2 +-
>>   drivers/net/ethernet/intel/fm10k/fm10k_iov.c         |  2 +-
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c   |  2 +-
>>   drivers/net/ethernet/intel/igb/igb_main.c            |  2 +-
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |  4 ++--
>>   drivers/net/ethernet/mellanox/mlx4/main.c            |  2 +-
>>   drivers/net/ethernet/neterion/vxge/vxge-main.c       |  2 +-
>>   drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |  2 +-
>>   drivers/net/ethernet/sfc/siena_sriov.c               |  2 +-
>>   drivers/pci/iov.c                                    | 12 +++++++-----
>>   drivers/pci/pci-driver.c                             | 11 ++++++++---
>>   drivers/scsi/lpfc/lpfc_init.c                        |  2 +-
>>   include/linux/pci.h                                  |  5 +++--
>>   17 files changed, 33 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c
>> index 4cf8f82cfca2..69253ca17506 100644
>> --- a/drivers/misc/genwqe/card_base.c
>> +++ b/drivers/misc/genwqe/card_base.c
>> @@ -1325,7 +1325,7 @@ static int genwqe_sriov_configure(struct pci_dev *dev, int numvfs)
>>
>>   	if (numvfs > 0) {
>>   		genwqe_setup_vf_jtimer(cd);
>> -		rc = pci_enable_sriov(dev, numvfs);
>> +		rc = pci_enable_sriov(dev, numvfs, 1);
>>   		if (rc < 0)
>>   			return rc;
>>   		return numvfs;
>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
>> index c88b20af87df..773b20224a47 100644
>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
>> @@ -2570,7 +2570,7 @@ int bnx2x_enable_sriov(struct bnx2x *bp)
>>   	if (rc)
>>   		return rc;
>>
>> -	rc = pci_enable_sriov(bp->pdev, req_vfs);
>> +	rc = pci_enable_sriov(bp->pdev, req_vfs, 1);
>>   	if (rc) {
>>   		BNX2X_ERR("pci_enable_sriov failed with %d\n", rc);
>>   		return rc;
>> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> index 3aea82bb9039..6e8afbfd3eba 100644
>> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> @@ -6597,7 +6597,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>   sriov:
>>   #ifdef CONFIG_PCI_IOV
>>   	if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
>> -		if (pci_enable_sriov(pdev, num_vf[func]) == 0)
>> +		if (pci_enable_sriov(pdev, num_vf[func], 1) == 0)
>>   			dev_info(&pdev->dev,
>>   				 "instantiated %u virtual functions\n",
>>   				 num_vf[func]);
>> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
>> index 86ee350e57f0..8a8b1d86f18a 100644
>> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
>> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
>> @@ -2421,7 +2421,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>   		pci_read_config_word(pdev, pos + PCI_SRIOV_TOTAL_VF,
>>   			&enic->num_vfs);
>>   		if (enic->num_vfs) {
>> -			err = pci_enable_sriov(pdev, enic->num_vfs);
>> +			err = pci_enable_sriov(pdev, enic->num_vfs, 1);
>>   			if (err) {
>>   				dev_err(dev, "SRIOV enable failed, aborting."
>>   					" pci_enable_sriov() returned %d\n",
>> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
>> index dc77ec2bdafd..a96491777ac4 100644
>> --- a/drivers/net/ethernet/emulex/benet/be_main.c
>> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
>> @@ -3274,7 +3274,7 @@ static int be_vf_setup(struct be_adapter *adapter)
>>   	}
>>
>>   	if (!old_vfs) {
>> -		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
>> +		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>>   		if (status) {
>>   			dev_err(dev, "SRIOV enable failed\n");
>>   			adapter->num_vfs = 0;
>> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
>> index 060190864238..04a3dc5acc28 100644
>> --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
>> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
>> @@ -408,7 +408,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
>>   		 */
>>   		fm10k_disable_aer_comp_abort(pdev);
>>
>> -		err = pci_enable_sriov(pdev, num_vfs);
>> +		err = pci_enable_sriov(pdev, num_vfs, 1);
>>   		if (err) {
>>   			dev_err(&pdev->dev,
>>   				"Enable PCI SR-IOV failed: %d\n", err);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> index 668d860275d6..fe56e09725f2 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> @@ -852,7 +852,7 @@ int i40e_alloc_vfs(struct i40e_pf *pf, u16 num_alloc_vfs)
>>
>>   	/* Check to see if we're just allocating resources for extant VFs */
>>   	if (pci_num_vf(pf->pdev) != num_alloc_vfs) {
>> -		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs);
>> +		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs, 1);
>>   		if (ret) {
>>   			dev_err(&pf->pdev->dev,
>>   				"Failed to enable SR-IOV, error %d.\n", ret);
>> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
>> index 3c0221620c9d..da01326ef550 100644
>> --- a/drivers/net/ethernet/intel/igb/igb_main.c
>> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
>> @@ -2742,7 +2742,7 @@ static int igb_enable_sriov(struct pci_dev *pdev, int num_vfs)
>>
>>   	/* only call pci_enable_sriov() if no VFs are allocated already */
>>   	if (!old_vfs) {
>> -		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count);
>> +		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count, 1);
>>   		if (err)
>>   			goto err_out;
>>   	}
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>> index 04eee7c7b653..74b33483a0d1 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>> @@ -149,7 +149,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
>>   		 */
>>   		adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, IXGBE_MAX_VFS_DRV_LIMIT);
>>
>> -		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
>> +		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>>   		if (err) {
>>   			e_err(probe, "Failed to enable PCI sriov: %d\n", err);
>>   			adapter->num_vfs = 0;
>> @@ -270,7 +270,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
>>   	for (i = 0; i < adapter->num_vfs; i++)
>>   		ixgbe_vf_configuration(dev, (i | 0x10000000));
>>
>> -	err = pci_enable_sriov(dev, num_vfs);
>> +	err = pci_enable_sriov(dev, num_vfs, 1);
>>   	if (err) {
>>   		e_dev_warn("Failed to enable PCI sriov: %d\n", err);
>>   		return err;
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>> index 3044f9e623cb..ae38b556ec13 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>> @@ -2350,7 +2350,7 @@ static u64 mlx4_enable_sriov(struct mlx4_dev *dev, struct pci_dev *pdev,
>>   					 existing_vfs, total_vfs);
>>   		} else {
>>   			mlx4_warn(dev, "Enabling SR-IOV with %d VFs\n", total_vfs);
>> -			err = pci_enable_sriov(pdev, total_vfs);
>> +			err = pci_enable_sriov(pdev, total_vfs, 1);
>>   		}
>>   		if (err) {
>>   			mlx4_err(dev, "Failed to enable SR-IOV, continuing without SR-IOV (err = %d)\n",
>> diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
>> index cc0485e3c621..c341e73fc68c 100644
>> --- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
>> +++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
>> @@ -4495,7 +4495,7 @@ vxge_probe(struct pci_dev *pdev, const struct pci_device_id *pre)
>>   	/* Enable SRIOV mode, if firmware has SRIOV support and if it is a PF */
>>   	if (is_sriov(function_mode) && !is_sriov_initialized(pdev) &&
>>   	   (ll_config->intr_type != INTA)) {
>> -		ret = pci_enable_sriov(pdev, num_vfs);
>> +		ret = pci_enable_sriov(pdev, num_vfs, 1);
>>   		if (ret)
>>   			vxge_debug_ll_config(VXGE_ERR,
>>   				"Failed in enabling SRIOV mode: %d\n", ret);
>> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
>> index a29538b86edf..b483705a1ef1 100644
>> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
>> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
>> @@ -570,7 +570,7 @@ static int qlcnic_sriov_pf_enable(struct qlcnic_adapter *adapter, int num_vfs)
>>   	if (!qlcnic_sriov_enable_check(adapter))
>>   		return 0;
>>
>> -	err = pci_enable_sriov(adapter->pdev, num_vfs);
>> +	err = pci_enable_sriov(adapter->pdev, num_vfs, 1);
>>   	if (err)
>>   		qlcnic_sriov_pf_cleanup(adapter);
>>
>> diff --git a/drivers/net/ethernet/sfc/siena_sriov.c b/drivers/net/ethernet/sfc/siena_sriov.c
>> index a8bbbad68a88..6804ed04cfcd 100644
>> --- a/drivers/net/ethernet/sfc/siena_sriov.c
>> +++ b/drivers/net/ethernet/sfc/siena_sriov.c
>> @@ -1332,7 +1332,7 @@ int efx_siena_sriov_init(struct efx_nic *efx)
>>
>>   	/* At this point we must be ready to accept VFDI requests */
>>
>> -	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count);
>> +	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count, 1);
>>   	if (rc)
>>   		goto fail_pci;
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 4d109c07294a..f6aba5beea78 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -57,7 +57,7 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
>>   		pci_remove_bus(virtbus);
>>   }
>>
>> -static int virtfn_add(struct pci_dev *dev, int id, int reset)
>> +static int virtfn_add(struct pci_dev *dev, int id, int reset, int probe)
>>   {
>>   	int i;
>>   	int rc = -ENOMEM;
>> @@ -85,6 +85,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>>   	virtfn->physfn = pci_dev_get(dev);
>>   	virtfn->is_virtfn = 1;
>>   	virtfn->multifunction = 0;
>> +	virtfn->probe_vf = probe;
>>
>>   	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>   		res = dev->resource + PCI_IOV_RESOURCES + i;
>> @@ -170,7 +171,7 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>>   	pci_dev_put(dev);
>>   }
>>
>> -static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>> +static int sriov_enable(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>>   {
>>   	int rc;
>>   	int i, j;
>> @@ -255,7 +256,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>   		initial = nr_virtfn;
>>
>>   	for (i = 0; i < initial; i++) {
>> -		rc = virtfn_add(dev, i, 0);
>> +		rc = virtfn_add(dev, i, 0, probe_vfs);
>>   		if (rc)
>>   			goto failed;
>>   	}
>> @@ -558,17 +559,18 @@ int pci_iov_bus_range(struct pci_bus *bus)
>>    * pci_enable_sriov - enable the SR-IOV capability
>>    * @dev: the PCI device
>>    * @nr_virtfn: number of virtual functions to enable
>> + * @probe_vfs: in zero, don't probe new VFs, otherwise probe if suitable driver available
>>    *
>>    * Returns 0 on success, or negative on failure.
>>    */
>> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>>   {
>>   	might_sleep();
>>
>>   	if (!dev->is_physfn)
>>   		return -ENOSYS;
>>
>> -	return sriov_enable(dev, nr_virtfn);
>> +	return sriov_enable(dev, nr_virtfn, probe_vfs);
>>   }
>>   EXPORT_SYMBOL_GPL(pci_enable_sriov);
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 2b3c89425bb5..d5b93339b8a4 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -397,9 +397,14 @@ static int pci_device_probe(struct device *dev)
>>   	drv = to_pci_driver(dev->driver);
>>   	pci_dev = to_pci_dev(dev);
>>   	pci_dev_get(pci_dev);
>> -	error = __pci_device_probe(drv, pci_dev);
>> -	if (error)
>> -		pci_dev_put(pci_dev);
>> +	if (!pci_dev->is_virtfn || pci_dev->probe_vf) {
>> +		error = __pci_device_probe(drv, pci_dev);
>> +		if (error)
>> +			pci_dev_put(pci_dev);
>> +	}
>> +	/* one shot blocking of probe */
>> +	if (pci_dev->is_virtfn && !pci_dev->probe_vf)
>> +		pci_dev->probe_vf = 1;
>>
>>   	return error;
>>   }
>> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
>> index 0b2c53af85c7..2f81f471b8f3 100644
>> --- a/drivers/scsi/lpfc/lpfc_init.c
>> +++ b/drivers/scsi/lpfc/lpfc_init.c
>> @@ -4797,7 +4797,7 @@ lpfc_sli_probe_sriov_nr_virtfn(struct lpfc_hba *phba, int nr_vfn)
>>   		return -EINVAL;
>>   	}
>>
>> -	rc = pci_enable_sriov(pdev, nr_vfn);
>> +	rc = pci_enable_sriov(pdev, nr_vfn, 1);
>>   	if (rc) {
>>   		lpfc_printf_log(phba, KERN_WARNING, LOG_INIT,
>>   				"2806 Failed to enable sriov on this device "
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 4c8ac5fcc224..beb2640ba18d 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -373,6 +373,7 @@ struct pci_dev {
>>   	phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
>>   	size_t romlen; /* Length of ROM if it's not from the BAR */
>>   	char *driver_override; /* Driver name to force a match */
>> +	int probe_vf; /* probe this device */
>>   };
>>
>>   static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
>> @@ -1655,14 +1656,14 @@ int pci_ext_cfg_avail(void);
>>   void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>>
>>   #ifdef CONFIG_PCI_IOV
>> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs);
>>   void pci_disable_sriov(struct pci_dev *dev);
>>   int pci_num_vf(struct pci_dev *dev);
>>   int pci_vfs_assigned(struct pci_dev *dev);
>>   int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>>   int pci_sriov_get_totalvfs(struct pci_dev *dev);
>>   #else
>> -static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>> +static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int nr_virt_probe)
>>   { return -ENODEV; }
>>   static inline void pci_disable_sriov(struct pci_dev *dev) { }
>>   static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
>> --
>> 2.1.3
>>

^ permalink raw reply

* [PATCH] bonding: cleanup bond_opts array
From: Jonathan Toppins @ 2015-01-09 18:31 UTC (permalink / raw)
  To: netdev; +Cc: shm, Andy Gospodarek, Nikolay Aleksandrov

Remove the empty array element initializer and size the array with
BOND_OPT_LAST so the compiler will complain if more elements are in
there than should be.

An interesting unwanted side effect of this initializer is that if one
inserts new options into the middle of the array then this initializer
will zero out the option that equals BOND_OPT_TLB_DYNAMIC_LB+1.

Example:
Extend the OPTS enum:
enum {
   ...
   BOND_OPT_TLB_DYNAMIC_LB,
   BOND_OPT_LACP_NEW1,
   BOND_OPT_LAST
};

Now insert into bond_opts array:
static const struct bond_option bond_opts[] = {
      ...
      [BOND_OPT_LACP_RATE] = { .... unchanged stuff .... },
      [BOND_OPT_LACP_NEW1] = { ... new stuff ... },
      ...
      [BOND_OPT_TLB_DYNAMIC_LB] = { .... unchanged stuff ....},
      { } // MARK A
};

Since BOND_OPT_LACP_NEW1 = BOND_OPT_TLB_DYNAMIC_LB+1, the last
initializer (MARK A) will overwrite the contents of BOND_OPT_LACP_NEW1
and can be easily viewed with the crash utility.

Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
---
 drivers/net/bonding/bond_options.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 1a61cc9..9bd538d4 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -186,7 +186,7 @@ static const struct bond_opt_value bond_tlb_dynamic_lb_tbl[] = {
 	{ NULL,  -1, 0}
 };

-static const struct bond_option bond_opts[] = {
+static const struct bond_option bond_opts[BOND_OPT_LAST] = {
 	[BOND_OPT_MODE] = {
 		.id = BOND_OPT_MODE,
 		.name = "mode",
@@ -379,8 +379,7 @@ static const struct bond_option bond_opts[] = {
 		.values = bond_tlb_dynamic_lb_tbl,
 		.flags = BOND_OPTFLAG_IFDOWN,
 		.set = bond_option_tlb_dynamic_lb_set,
-	},
-	{ }
+	}
 };

 /* Searches for an option by name */
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 1/1] update ip-sysctl.txt documentation (v2)
From: Ani Sinha @ 2015-01-09 18:34 UTC (permalink / raw)
  To: Jonathan Corbet, David Miller, Eric Dumazet, linux-doc,
	linux-kernel, Ani Sinha, Pádraig Brady,
	netdev@vger.kernel.org, fruggeri
In-Reply-To: <1420674356-32210-1-git-send-email-ani@arista.com>

On Wed, Jan 7, 2015 at 3:45 PM, Ani Sinha <ani@arista.com> wrote:
> Update documentation to reflect the fact that
> /proc/sys/net/ipv4/route/max_size is no longer used for ipv4.

Any more feedback on this?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox