Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] Revert "rds: ib: add error handle"
From: Santosh Shilimkar @ 2018-04-24 16:58 UTC (permalink / raw)
  To: Dag Moxnes, Håkon Bugge
  Cc: Zhu Yanjun, OFED mailing list, rds-devel, davem, netdev
In-Reply-To: <373de57c-4cab-0d04-0021-57b566cefe0d@oracle.com>

On 4/24/2018 4:25 AM, Dag Moxnes wrote:
> I was going to suggest the following correction:
> 
> 
> If all agree that this is the correct way of doing it, I can go ahead 
> and an post it.
> 
Yes please. Go ahead and post your fix.

Regards,
Santosh
P.S: Avoid top posting please.

^ permalink raw reply

* Re: [PATCH bpf-next 13/15] xsk: support for Tx
From: Willem de Bruijn @ 2018-04-24 16:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423135619.7179-14-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, Tx support is added. The user fills the Tx queue with frames to
> be sent by the kernel, and let's the kernel know using the sendmsg
> syscall.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

> +static int xsk_xmit_skb(struct sk_buff *skb)

This is basically packet_direct_xmit. Might be better to just move that
to net/core/dev.c and use in both AF_PACKET and AF_XDP.

Also, (eventually) AF_XDP may also want to support the regular path
through dev_queue_xmit to go through traffic shaping.

> +{
> +       struct net_device *dev = skb->dev;
> +       struct sk_buff *orig_skb = skb;
> +       struct netdev_queue *txq;
> +       int ret = NETDEV_TX_BUSY;
> +       bool again = false;
> +
> +       if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
> +               goto drop;
> +
> +       skb = validate_xmit_skb_list(skb, dev, &again);
> +       if (skb != orig_skb)
> +               return NET_XMIT_DROP;

Need to free generated segment list on error, see packet_direct_xmit.

> +
> +       txq = skb_get_tx_queue(dev, skb);
> +
> +       local_bh_disable();
> +
> +       HARD_TX_LOCK(dev, txq, smp_processor_id());
> +       if (!netif_xmit_frozen_or_drv_stopped(txq))
> +               ret = netdev_start_xmit(skb, dev, txq, false);
> +       HARD_TX_UNLOCK(dev, txq);
> +
> +       local_bh_enable();
> +
> +       if (!dev_xmit_complete(ret))
> +               goto out_err;
> +
> +       return ret;
> +drop:
> +       atomic_long_inc(&dev->tx_dropped);
> +out_err:
> +       return NET_XMIT_DROP;
> +}

> +static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
> +                           size_t total_len)
> +{
> +       bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
> +       u32 max_batch = TX_BATCH_SIZE;
> +       struct xdp_sock *xs = xdp_sk(sk);
> +       bool sent_frame = false;
> +       struct xdp_desc desc;
> +       struct sk_buff *skb;
> +       int err = 0;
> +
> +       if (unlikely(!xs->tx))
> +               return -ENOBUFS;
> +       if (need_wait)
> +               return -EOPNOTSUPP;
> +
> +       mutex_lock(&xs->mutex);
> +
> +       while (xskq_peek_desc(xs->tx, &desc)) {

It is possible to pass a chain of skbs to validate_xmit_skb_list and
eventually pass this chain to xsk_xmit_skb, amortizing the cost of
taking the txq lock. Fine to ignore for this patch set.

> +               char *buffer;
> +               u32 id, len;
> +
> +               if (max_batch-- == 0) {
> +                       err = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               if (xskq_reserve_id(xs->umem->cq)) {
> +                       err = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               len = desc.len;
> +               if (unlikely(len > xs->dev->mtu)) {
> +                       err = -EMSGSIZE;
> +                       goto out;
> +               }
> +
> +               skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
> +               if (unlikely(!skb)) {
> +                       err = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               skb_put(skb, len);
> +               id = desc.idx;
> +               buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
> +               err = skb_store_bits(skb, 0, buffer, len);
> +               if (unlikely(err))
> +                       goto out_store;

As xsk_destruct_skb delays notification until consume_skb is called, this
copy can be avoided by linking the xdp buffer into the skb frags array,
analogous to tpacket_snd.

You probably don't care much about the copy slow path, and this can be
implemented later, so also no need to do in this patchset.

static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
+                                             struct xdp_desc *desc)
+{
+       struct xdp_rxtx_ring *ring;
+
+       if (q->cons_tail == q->cons_head) {
+               WRITE_ONCE(q->ring->consumer, q->cons_tail);
+               q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+               /* Order consumer and data */
+               smp_rmb();
+
+               return xskq_validate_desc(q, desc);
+       }
+
+       ring = (struct xdp_rxtx_ring *)q->ring;
+       *desc = ring->desc[q->cons_tail & q->ring_mask];
+       return desc;

This only validates descriptors if taking the branch.

^ permalink raw reply

* Re: [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support
From: Willem de Bruijn @ 2018-04-24 16:56 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <20180423135619.7179-8-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> Here the actual receive functions of AF_XDP are implemented, that in a
> later commit, will be called from the XDP layers.
>
> There's one set of functions for the XDP_DRV side and another for
> XDP_SKB (generic).
>
> Support for the poll syscall is also implemented.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---

> +/* Common functions operating for both RXTX and umem queues */
> +
> +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
> +{
> +       u32 entries = q->prod_tail - q->cons_tail;
> +
> +       if (entries == 0) {
> +               /* Refresh the local pointer */
> +               q->prod_tail = READ_ONCE(q->ring->producer);
> +       }
> +
> +       entries = q->prod_tail - q->cons_tail;

Probably meant to be inside the branch? Though I see the same
pattern in the userspace example program.

> +static inline u32 *xskq_validate_id(struct xsk_queue *q)
> +{
> +       while (q->cons_tail != q->cons_head) {
> +               struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
> +               unsigned int idx = q->cons_tail & q->ring_mask;
> +
> +               if (xskq_is_valid_id(q, ring->desc[idx]))
> +                       return &ring->desc[idx];

Missing a q->cons_tail increment in this loop?

^ permalink raw reply

* Re: [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
From: Willem de Bruijn @ 2018-04-24 16:56 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <20180423135619.7179-9-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> The xskmap is yet another BPF map, very much inspired by
> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
>
> Note that a socket that is bound to certain ifindex/queue index will
> *only* accept XDP frames from that netdev/queue index. If an XDP
> program tries to redirect from a netdev/queue index other than what
> the socket is bound to, the frame will not be received on the socket.
>
> A socket can reside in multiple maps.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

> +struct xsk_map_entry {
> +       struct xdp_sock *xs;
> +       struct rcu_head rcu;
> +};

> +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
> +{
> +       struct xsk_map *m = container_of(map, struct xsk_map, map);
> +       struct xsk_map_entry *entry;
> +
> +       if (key >= map->max_entries)
> +               return NULL;
> +
> +       entry = READ_ONCE(m->xsk_map[key]);
> +       return entry ? entry->xs : NULL;
> +}

This dynamically allocated structure adds an extra cacheline lookup. If
xdp_sock gets an rcu_head, it can be linked into the map directly.

^ permalink raw reply

* Re: [PATCH bpf-next 05/15] xsk: add support for bind for Rx
From: Willem de Bruijn @ 2018-04-24 16:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423135619.7179-6-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, the bind syscall is added. Binding an AF_XDP socket, means
> associating the socket to an umem, a netdev and a queue index. This
> can be done in two ways.
>
> The first way, creating a "socket from scratch". Create the umem using
> the XDP_UMEM_REG setsockopt and an associated fill queue with
> XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
> setsockopt. Call bind passing ifindex and queue index ("channel" in
> ethtool speak).
>
> The second way to bind a socket, is simply skipping the
> umem/netdev/queue index, and passing another already setup AF_XDP
> socket. The new socket will then have the same umem/netdev/queue index
> as the parent so it will share the same umem. You must also set the
> flags field in the socket address to XDP_SHARED_UMEM.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---

> +static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
> +{
> +       struct socket *sock;
> +
> +       *err = -ENOTSOCK;
> +       sock = sockfd_lookup(fd, err);
> +       if (!sock)
> +               return NULL;
> +
> +       if (sock->sk->sk_family != PF_XDP) {
> +               *err = -ENOPROTOOPT;
> +               sockfd_put(sock);
> +               return NULL;
> +       }
> +
> +       *err = 0;
> +       return sock;
> +}

In this and similar cases, can use ERR_PTR to avoid the extra argument.

^ permalink raw reply

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Willem de Bruijn @ 2018-04-24 16:55 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Michael S. Tsirkin, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAJ8uoz0CSmQeO6E4fNrvivPau5zDOJhTtbXBxX_Z3HY5c3gvAQ@mail.gmail.com>

>>>> +/* Pgoff for mmaping the rings */
>>>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>>>> +
>>>> +struct xdp_ring {
>>>> +     __u32 producer __attribute__((aligned(64)));
>>>> +     __u32 consumer __attribute__((aligned(64)));
>>>> +};
>>>
>>> Why 64? And do you still need these guys in uapi?
>>
>> I was just about to ask the same. You mean cacheline_aligned?
>
> Yes, I would like to have these cache aligned. How can I accomplish
> this in a uapi?

Good point. This seems fine to me.

> I put a note around this in the cover letter:
>
> * How to deal with cache alignment for uapi when different
>   architectures can have different cache line sizes? We have just
>   aligned it to 64 bytes for now, which works for many popular
>   architectures, but not all. Please advise.
>
>>
>>>> +static int xsk_mmap(struct file *file, struct socket *sock,
>>>> +                 struct vm_area_struct *vma)
>>>> +{
>>>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>>>> +     unsigned long size = vma->vm_end - vma->vm_start;
>>>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>>>> +     struct xsk_queue *q;
>>>> +     unsigned long pfn;
>>>> +     struct page *qpg;
>>>> +
>>>> +     if (!xs->umem)
>>>> +             return -EINVAL;
>>>> +
>>>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>>>> +             q = xs->umem->fq;
>>>> +     else
>>>> +             return -EINVAL;
>>>> +
>>>> +     qpg = virt_to_head_page(q->ring);
>>
>> Is it assured that q is initialized with a call to setsockopt
>> XDP_UMEM_FILL_RING before the call the mmap?
>
> Unfortunately not, so this is a bug. Case in point for running
> syzkaller below, definitely.
>
>> In general, with such an extensive new API, it might be worthwhile to
>> run syzkaller locally on a kernel with these patches. It is pretty
>> easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
>> though it also needs to be taught about any new APIs.
>
> Good idea. Will set this up and have it torture the API.
>
> Thanks: Magnus

Great, thanks. I forgot to mention how to encode the new APIs for syzkaller:

https://github.com/google/syzkaller/blob/master/docs/syscall_descriptions.md

^ permalink raw reply

* Re: [PATCH] net: phy: TLK10X initial driver submission
From: Florian Fainelli @ 2018-04-24 16:52 UTC (permalink / raw)
  To: Måns Andersson, Rob Herring, Mark Rutland, Andrew Lunn,
	netdev, devicetree, linux-kernel
In-Reply-To: <20180419082816.109338-1-mans.andersson@nibe.se>



On 04/19/2018 01:28 AM, Måns Andersson wrote:
> From: Mans Andersson <mans.andersson@nibe.se>
> 
> Add suport for the TI TLK105 and TLK106 10/100Mbit ethernet phys.
> 
> In addition the TLK10X needs to be removed from DP83848 driver as the
> power back off support is added here for this device.

I would not think this is a compelling enough reason, you could very
well just adjust the dp83848.c driver just to account for these
properties that you are introducing. More comments below.

[snip]

> +#define TLK10X_INT_EN_MASK		\
> +	(TLK10X_MISR_ANC_INT_EN |	\
> +	 TLK10X_MISR_DUP_INT_EN |	\
> +	 TLK10X_MISR_SPD_INT_EN |	\
> +	 TLK10X_MISR_LINK_INT_EN)
> +
> +struct tlk10x_private {
> +	int pwrbo_level;

unsigned int

> +};
> +
> +static int tlk10x_read(struct phy_device *phydev, int reg)
> +{
> +	if (reg & ~0x1f) {
> +		/* Extended register */
> +		phy_write(phydev, TLK10X_REGCR, 0x001F);
> +		phy_write(phydev, TLK10X_ADDAR, reg);
> +		phy_write(phydev, TLK10X_REGCR, 0x401F);
> +		reg = TLK10X_ADDAR;
> +	}

Humm, this looks a bit fragile, you would likely want to create separate
helper functions for these extended registers and make sure you handle
write failures as well. Also consider making use of the page helpers
from include/linux/phy.h.

> +
> +	return phy_read(phydev, reg);
> +}
> +
> +static int tlk10x_write(struct phy_device *phydev, int reg, int val)
> +{
> +	if (reg & ~0x1f) {
> +		/* Extended register */
> +		phy_write(phydev, TLK10X_REGCR, 0x001F);
> +		phy_write(phydev, TLK10X_ADDAR, reg);
> +		phy_write(phydev, TLK10X_REGCR, 0x401F);
> +		reg = TLK10X_ADDAR;
> +	}

Same here.

> +
> +	return phy_write(phydev, reg, val);
> +}
> +
> +#ifdef CONFIG_OF_MDIO
> +static int tlk10x_of_init(struct phy_device *phydev)
> +{
> +	struct tlk10x_private *tlk10x = phydev->priv;
> +	struct device *dev = &phydev->mdio.dev;
> +	struct device_node *of_node = dev->of_node;
> +	int ret;
> +
> +	if (!of_node)
> +		return 0;
> +
> +	ret = of_property_read_u32(of_node, "ti,power-back-off",
> +				   &tlk10x->pwrbo_level);
> +	if (ret) {
> +		dev_err(dev, "missing ti,power-back-off property");
> +		tlk10x->pwrbo_level = 0;

This should not be necessary, that should be the default with a zero
initialized private data structure.

> +	}
> +
> +	return 0;
> +}
> +#else
> +static int tlk10x_of_init(struct phy_device *phydev)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_OF_MDIO */
> +
> +static int tlk10x_config_init(struct phy_device *phydev)
> +{
> +	int ret, reg;
> +	struct tlk10x_private *tlk10x;
> +
> +	ret = genphy_config_init(phydev);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (!phydev->priv) {
> +		tlk10x = devm_kzalloc(&phydev->mdio.dev, sizeof(*tlk10x),
> +				      GFP_KERNEL);
> +		if (!tlk10x)
> +			return -ENOMEM;
> +
> +		phydev->priv = tlk10x;
> +		ret = tlk10x_of_init(phydev);
> +		if (ret)
> +			return ret;
> +	} else {
> +		tlk10x = (struct tlk10x_private *)phydev->priv;
> +	}

You need to implement a probe() function that is responsible for
allocation private memory instead of doing this check.

> +
> +	// Power back off
> +	if (tlk10x->pwrbo_level < 0 || tlk10x->pwrbo_level > 3)
> +		tlk10x->pwrbo_level = 0;

How can you have pwrb_level < 0 when you use of_read_property_u32()?

> +	reg = tlk10x_read(phydev, TLK10X_PWRBOCR);
> +	reg = ((reg & ~TLK10X_PWRBOCR_MASK)
> +		| (tlk10x->pwrbo_level << 6));

One too many levels of parenthesis, the outer ones should not be necessary.

> +	ret = tlk10x_write(phydev, TLK10X_PWRBOCR, reg);
> +	if (ret < 0) {
> +		dev_err(&phydev->mdio.dev,
> +			"unable to set power back-off (err=%d)\n", ret);
> +		return ret;
> +	}
> +	dev_info(&phydev->mdio.dev, "power back-off set to level %d\n",
> +		 tlk10x->pwrbo_level);

config_init() is called often, consider making this a debugging statement.

-- 
Florian

^ permalink raw reply

* Re: VRF: Ingress IPv6 Linklocal/Multicast destined pkt from slave VRF device does not map to Master device socket
From: David Ahern @ 2018-04-24 16:51 UTC (permalink / raw)
  To: Sukumar Gopalakrishnan, netdev
In-Reply-To: <CADiZnkRRJCrHu_QkwNb3G49gdyicJkbeB8YctrMb6jZc9uq6rg@mail.gmail.com>

On 4/23/18 11:57 PM, Sukumar Gopalakrishnan wrote:
> Get master device address from (skb->dev) and  pass master  to socket
> lookup up function for Ipv6 Linklocal/Multicast address.
> 
> ipv6_raw_deliver()
> {
> int mdif;
> ..
> ..
>         mdif = (((nexthdr == IPPROTO_PIM || nexthdr == 89 /* IPPROTO_OSPF */ ||
>                 nexthdr == IPPROTO_ICMPV6 || nexthdr == 112 /*IPPROTO_VRRP*/) &&
>                 (ipv6_addr_type(daddr) &
>                 (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL))) ?
>                 l3mdev_master_ifindex_rcu(skb->dev) : inet6_iif(skb));
> 
> 
>         sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, mdif,
> inet6_sdif(skb));
> 

Packets destined to a linklocal and mcast address stay bound to the
actual ingress device as that is their scope.

^ permalink raw reply

* Re: [PATCH] net: phy: allow scanning busses with missing phys
From: Florian Fainelli @ 2018-04-24 16:37 UTC (permalink / raw)
  To: Alexandre Belloni, Andrew Lunn
  Cc: David S . Miller, Allan Nielsen, Thomas Petazzoni, netdev,
	linux-kernel
In-Reply-To: <20180424160904.32457-1-alexandre.belloni@bootlin.com>



On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
> Some MDIO busses will error out when trying to read a phy address with no
> phy present at that address. In that case, probing the bus will fail
> because __mdiobus_register() is scanning the bus for all possible phys
> addresses.
> 
> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
> this address and set the phy ID to 0xffffffff which is then properly
> handled in get_phy_device().

Humm, why not have your MDIO bus implementation do the scanning itself
in a reset() callback, which happens before probing the bus, and based
on the results, set phy_mask accordingly such that only PHYs present are
populated?

My only concern with your change is that we are having a special
treatment for EIO and ENODEV, so we must make sure MDIO bus drivers are
all conforming to that.

> 
> Suggested-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  drivers/net/phy/phy_device.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index ac23322a32e1..9e4ba8e80a18 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -535,8 +535,17 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id,
>  
>  	/* Grab the bits from PHYIR1, and put them in the upper half */
>  	phy_reg = mdiobus_read(bus, addr, MII_PHYSID1);
> -	if (phy_reg < 0)
> +	if (phy_reg < 0) {
> +		/* if there is no device, return without an error so scanning
> +		 * the bus works properly
> +		 */
> +		if (phy_reg == -EIO || phy_reg == -ENODEV) {
> +			*phy_id = 0xffffffff;
> +			return 0;
> +		}
> +
>  		return -EIO;
> +	}
>  
>  	*phy_id = (phy_reg & 0xffff) << 16;
>  
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH] net: sh-eth: fix sh_eth_start_xmit()'s return type
From: Sergei Shtylyov @ 2018-04-24 16:36 UTC (permalink / raw)
  To: Luc Van Oostenryck, linux-kernel
  Cc: David S. Miller, Geert Uytterhoeven, Thomas Petazzoni,
	Laurent Pinchart, Simon Horman, Niklas Söderlund, netdev,
	linux-renesas-soc
In-Reply-To: <20180424131720.4357-1-luc.vanoostenryck@gmail.com>

Hello!

On 04/24/2018 04:17 PM, Luc Van Oostenryck wrote:

> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
> 
> Fix this by returning 'netdev_tx_t' in this driver too.
> 
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

[...]
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index b6b90a631..0875a169f 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -2454,7 +2454,7 @@ static void sh_eth_tx_timeout(struct net_device *ndev)
>  }
>  
>  /* Packet transmit function */
> -static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
> +static netdev_tx_t sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)

   But aren't you violating 80-column limit?

[...]

MBR, Sergei

^ permalink raw reply

* Re: [PATCH] net: phy: TLK10X initial driver submission
From: Rob Herring @ 2018-04-24 16:34 UTC (permalink / raw)
  To: Måns Andersson
  Cc: Mark Rutland, Andrew Lunn, Florian Fainelli, netdev, devicetree,
	linux-kernel
In-Reply-To: <20180419082816.109338-1-mans.andersson@nibe.se>

On Thu, Apr 19, 2018 at 10:28:16AM +0200, Måns Andersson wrote:
> From: Mans Andersson <mans.andersson@nibe.se>
> 
> Add suport for the TI TLK105 and TLK106 10/100Mbit ethernet phys.
> 
> In addition the TLK10X needs to be removed from DP83848 driver as the
> power back off support is added here for this device.
> 
> Datasheet:
> http://www.ti.com/lit/gpn/tlk106
> ---
>  .../devicetree/bindings/net/ti,tlk10x.txt          |  27 +++

Please split bindings to a separate patch.

>  drivers/net/phy/Kconfig                            |   5 +
>  drivers/net/phy/Makefile                           |   1 +
>  drivers/net/phy/dp83848.c                          |   3 -
>  drivers/net/phy/tlk10x.c                           | 209 +++++++++++++++++++++
>  5 files changed, 242 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/net/ti,tlk10x.txt
>  create mode 100644 drivers/net/phy/tlk10x.c
> 
> diff --git a/Documentation/devicetree/bindings/net/ti,tlk10x.txt b/Documentation/devicetree/bindings/net/ti,tlk10x.txt
> new file mode 100644
> index 0000000..371d0d7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/ti,tlk10x.txt
> @@ -0,0 +1,27 @@
> +* Texas Instruments - TLK105 / TLK106 ethernet PHYs
> +
> +Required properties:
> +	- reg - The ID number for the phy, usually a small integer

Isn't this the MDIO bus address?

This should have a compatible string too.

> +
> +Optional properties:
> +	- ti,power-back-off - Power Back Off Level
> +		Please refer to data sheet chapter 8.6 and TI Application
> +		Note SLLA3228
> +		0 - Normal Operation
> +		1 - Level 1 (up to 140m cable between TLK link partners)
> +		2 - Level 2 (up to 100m cable between TLK link partners)
> +		3 - Level 3 (up to 80m cable between TLK link partners)
> +
> +Default child nodes are standard Ethernet PHY device
> +nodes as described in Documentation/devicetree/bindings/net/phy.txt
> +
> +Example:
> +
> +	ethernet-phy@0 {
> +		reg = <0>;
> +		ti,power-back-off = <2>;
> +	};
> +
> +Datasheets and documentation can be found at:
> +http://www.ti.com/lit/gpn/tlk106
> +http://www.ti.com/lit/an/slla328/slla328.pdf

Move this to before the properties.

Rob

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424161242.GK17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > 
> > > > Fixing __vmalloc code 
> > > > is easy and it doesn't require cooperation with maintainers.
> > > 
> > > But it is a hack against the intention of the scope api.
> > 
> > It is not!
> 
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.

That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
developer) from converting the code to the scope API. You make nonsensical 
excuses.

Mikulas

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 16:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424161242.GK17484@dhcp22.suse.cz>

On Tue 24-04-18 10:12:42, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > 
> > > > Fixing __vmalloc code 
> > > > is easy and it doesn't require cooperation with maintainers.
> > > 
> > > But it is a hack against the intention of the scope api.
> > 
> > It is not!
> 
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.

http://lkml.kernel.org/r/20180424162712.GL17484@dhcp22.suse.cz

let's see how it rolls this time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 16:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241142340.15660@file01.intranet.prod.int.rdu2.redhat.com>

On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > [...]
> > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >  	 */
> > >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > >  
> > > +#ifdef CONFIG_DEBUG_SG
> > > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > +	if (!(prandom_u32_max(2) & 1))
> > > +		goto do_vmalloc;
> > > +#endif
> > 
> > I really do not think there is anything DEBUG_SG specific here. Why you
> > simply do not follow should_failslab path or even reuse the function?
> 
> CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> there).

Are you telling me that you are shaping a debugging functionality basing
on what RHEL has enabled? And you call me evil. This is just rediculous.

> Fail-injection framework is if off by default and it must be explicitly 
> enabled and configured by the user - and most users won't enable it.

It can be enabled easily. And if you care enough for your debugging
kernel then just make it enabled unconditionally.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH net-next] net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()
From: David Ahern @ 2018-04-24 16:25 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180424162249.41820-1-edumazet@google.com>

On 4/24/18 10:22 AM, Eric Dumazet wrote:
> rt6_remove_exception_rt() is called under rcu_read_lock() only.
> 
> We lock rt6_exception_lock a bit later, so we do not hold
> rt6_exception_lock yet.
> 
> Fixes: 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Cc: David Ahern <dsahern@gmail.com>
> ---
>  net/ipv6/route.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
>

Acked-by: David Ahern <dsahern@gmail.com>

Thanks, Eric.

^ permalink raw reply

* [PATCH net-next] net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()
From: Eric Dumazet @ 2018-04-24 16:22 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, David Ahern

rt6_remove_exception_rt() is called under rcu_read_lock() only.

We lock rt6_exception_lock a bit later, so we do not hold
rt6_exception_lock yet.

Fixes: 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ac3e51631c659b5c5c8a93c17011cb7f3ad266e2..432c4bcc1111085671f32987e4673e47898085a3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1546,8 +1546,7 @@ static int rt6_remove_exception_rt(struct rt6_info *rt)
 	struct fib6_info *from;
 	int err;
 
-	from = rcu_dereference_protected(rt->from,
-					 lockdep_is_held(&rt6_exception_lock));
+	from = rcu_dereference(rt->from);
 	if (!from ||
 	    !(rt->rt6i_flags & RTF_CACHE))
 		return -EINVAL;
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [net-next regression] kselftest failure in fib_nl_newrule()
From: Roopa Prabhu @ 2018-04-24 16:17 UTC (permalink / raw)
  To: Anders Roxell
  Cc: David Miller, David Ahern, netdev, Linux Kernel Mailing List
In-Reply-To: <CADYN=9J5yfmZeQoEmUfF_zwGkru=L1V6G7Tbz8yxt12Z6=qPgg@mail.gmail.com>

On Tue, Apr 24, 2018 at 2:46 AM, Anders Roxell <anders.roxell@linaro.org> wrote:
> Hi,
>
> fib-onlink-tests.sh (from kselftest) found a regression between
> next-20180424 [1] (worked with tag next-20180423 [2])
>
> here is tree commits that look suspicious specially this patch (sha:
> f9d4b0c1e969)
> rewrites fib_nl_newrule().his patch (sha: f9d4b0c1e969) rewrites
> fib_nl_newrule().
>
> b16fb418b1bf ("net: fib_rules: add extack support")
> f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs
> into fib_nl2rule")
> 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
>
> Cheers,
> Anders
> [1] https://lkft.validation.linaro.org/scheduler/job/195181#L3447
> [2] https://lkft.validation.linaro.org/scheduler/job/193410#L3438


Thanks for the report.

It should be fixed by my last commit:
9c20b9372fba "net: fib_rules: fix l3mdev netlink attr processing"

Just ran them again and they pass.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 16:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241107010.31601@file01.intranet.prod.int.rdu2.redhat.com>

On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > 
> > > Fixing __vmalloc code 
> > > is easy and it doesn't require cooperation with maintainers.
> > 
> > But it is a hack against the intention of the scope api.
> 
> It is not!

This discussion simply doesn't make much sense it seems. The scope API
is to document the scope of the reclaim recursion critical section. That
certainly is not a utility function like vmalloc.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH net-next] Revert "net: init sk_cookie for inet socket"
From: Eric Dumazet @ 2018-04-24 16:10 UTC (permalink / raw)
  To: Yafang Shao, Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <CALOAHbAEO=XmxPoV0gyeqYAAzEE5WtaT9uh5TUSAJ0Q+XuHZ8g@mail.gmail.com>



On 04/24/2018 08:59 AM, Yafang Shao wrote:
> On Tue, Apr 24, 2018 at 11:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>> On 04/24/2018 08:12 AM, Yafang Shao wrote:
>>> On Tue, Apr 24, 2018 at 8:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>
>>>>
>>>> On 04/24/2018 05:05 AM, Yafang Shao wrote:
>>>>> This revert commit <c6849a3ac17e> ("net: init sk_cookie for inet socket")
>>>>>
>>>>> Per discussion with Eric.
>>>>>
>>>>
>>>> I suggest you include a bit more details, about cache line false sharing.
>>>>
>>>
>>> Coud we adjust the struct common to avoid such kind of cache line
>>> false sharing ?
>>> I mean removing "atomic64_t  skc_cookie;" from struct sock_common and
>>> place it in struct inet_sock ?
>>
>> The false sharing is not there, it is on net->cookie_gen
>>
> 
> Yes.
> This is the current issue.
> May be we should adjust struct net as well.

This field will still need to be modified by many cpus.

Its exact placement in memory wont avoid false sharing and stalls.

> 
> Regarding sk_cookie, as it is only used by inet_sock now, may be it is
> better placed in srtuct inet_sock ?
>

You are mistaken.

It is used on all sockets really (including request_sock and timewait)

ss -temoia   will give you socket ids for all sockets types.

^ permalink raw reply

* [PATCH v3 net] sfc: ARFS filter IDs
From: Edward Cree @ 2018-04-24 16:09 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

Associate an arbitrary ID with each ARFS filter, allowing to properly query
 for expiry.  The association is maintained in a hash table, which is
 protected by a spinlock.

v3: fix build warnings when CONFIG_RFS_ACCEL is disabled (thanks lkp-robot).
v2: fixed uninitialised variable (thanks davem and lkp-robot).

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c       |  80 +++++++++++--------
 drivers/net/ethernet/sfc/efx.c        | 143 ++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/efx.h        |  21 +++++
 drivers/net/ethernet/sfc/farch.c      |  41 ++++++++--
 drivers/net/ethernet/sfc/net_driver.h |  36 +++++++++
 drivers/net/ethernet/sfc/rx.c         |  62 +++++++++++++--
 6 files changed, 337 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 83ce229f4eb7..63036d9bf3e6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -3999,29 +3999,6 @@ static void efx_ef10_prepare_flr(struct efx_nic *efx)
 	atomic_set(&efx->active_queues, 0);
 }
 
-static bool efx_ef10_filter_equal(const struct efx_filter_spec *left,
-				  const struct efx_filter_spec *right)
-{
-	if ((left->match_flags ^ right->match_flags) |
-	    ((left->flags ^ right->flags) &
-	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
-		return false;
-
-	return memcmp(&left->outer_vid, &right->outer_vid,
-		      sizeof(struct efx_filter_spec) -
-		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
-}
-
-static unsigned int efx_ef10_filter_hash(const struct efx_filter_spec *spec)
-{
-	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
-	return jhash2((const u32 *)&spec->outer_vid,
-		      (sizeof(struct efx_filter_spec) -
-		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
-		      0);
-	/* XXX should we randomise the initval? */
-}
-
 /* Decide whether a filter should be exclusive or else should allow
  * delivery to additional recipients.  Currently we decide that
  * filters for specific local unicast MAC and IP addresses are
@@ -4346,7 +4323,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		goto out_unlock;
 	match_pri = rc;
 
-	hash = efx_ef10_filter_hash(spec);
+	hash = efx_filter_spec_hash(spec);
 	is_mc_recip = efx_filter_is_mc_recipient(spec);
 	if (is_mc_recip)
 		bitmap_zero(mc_rem_map, EFX_EF10_FILTER_SEARCH_LIMIT);
@@ -4378,7 +4355,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		if (!saved_spec) {
 			if (ins_index < 0)
 				ins_index = i;
-		} else if (efx_ef10_filter_equal(spec, saved_spec)) {
+		} else if (efx_filter_spec_equal(spec, saved_spec)) {
 			if (spec->priority < saved_spec->priority &&
 			    spec->priority != EFX_FILTER_PRI_AUTO) {
 				rc = -EPERM;
@@ -4762,27 +4739,62 @@ static s32 efx_ef10_filter_get_rx_ids(struct efx_nic *efx,
 static bool efx_ef10_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 					   unsigned int filter_idx)
 {
+	struct efx_filter_spec *spec, saved_spec;
 	struct efx_ef10_filter_table *table;
-	struct efx_filter_spec *spec;
-	bool ret;
+	struct efx_arfs_rule *rule = NULL;
+	bool ret = true, force = false;
+	u16 arfs_id;
 
 	down_read(&efx->filter_sem);
 	table = efx->filter_state;
 	down_write(&table->lock);
 	spec = efx_ef10_filter_entry_spec(table, filter_idx);
 
-	if (!spec || spec->priority != EFX_FILTER_PRI_HINT) {
-		ret = true;
+	if (!spec || spec->priority != EFX_FILTER_PRI_HINT)
 		goto out_unlock;
-	}
 
-	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, 0)) {
-		ret = false;
-		goto out_unlock;
+	spin_lock_bh(&efx->rps_hash_lock);
+	if (!efx->rps_hash_table) {
+		/* In the absence of the table, we always return 0 to ARFS. */
+		arfs_id = 0;
+	} else {
+		rule = efx_rps_hash_find(efx, spec);
+		if (!rule)
+			/* ARFS table doesn't know of this filter, so remove it */
+			goto expire;
+		arfs_id = rule->arfs_id;
+		ret = efx_rps_check_rule(rule, filter_idx, &force);
+		if (force)
+			goto expire;
+		if (!ret) {
+			spin_unlock_bh(&efx->rps_hash_lock);
+			goto out_unlock;
+		}
 	}
-
+	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, arfs_id))
+		ret = false;
+	else if (rule)
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+expire:
+	saved_spec = *spec; /* remove operation will kfree spec */
+	spin_unlock_bh(&efx->rps_hash_lock);
+	/* At this point (since we dropped the lock), another thread might queue
+	 * up a fresh insertion request (but the actual insertion will be held
+	 * up by our possession of the filter table lock).  In that case, it
+	 * will set rule->filter_id to EFX_ARFS_FILTER_ID_PENDING, meaning that
+	 * the rule is not removed by efx_rps_hash_del() below.
+	 */
 	ret = efx_ef10_filter_remove_internal(efx, 1U << spec->priority,
 					      filter_idx, true) == 0;
+	/* While we can't safely dereference rule (we dropped the lock), we can
+	 * still test it for NULL.
+	 */
+	if (ret && rule) {
+		/* Expiring, so remove entry from ARFS table */
+		spin_lock_bh(&efx->rps_hash_lock);
+		efx_rps_hash_del(efx, &saved_spec);
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 out_unlock:
 	up_write(&table->lock);
 	up_read(&efx->filter_sem);
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 692dd729ee2a..a4ebd8715494 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -3027,6 +3027,10 @@ static int efx_init_struct(struct efx_nic *efx,
 	mutex_init(&efx->mac_lock);
 #ifdef CONFIG_RFS_ACCEL
 	mutex_init(&efx->rps_mutex);
+	spin_lock_init(&efx->rps_hash_lock);
+	/* Failure to allocate is not fatal, but may degrade ARFS performance */
+	efx->rps_hash_table = kcalloc(EFX_ARFS_HASH_TABLE_SIZE,
+				      sizeof(*efx->rps_hash_table), GFP_KERNEL);
 #endif
 	efx->phy_op = &efx_dummy_phy_operations;
 	efx->mdio.dev = net_dev;
@@ -3070,6 +3074,10 @@ static void efx_fini_struct(struct efx_nic *efx)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	kfree(efx->rps_hash_table);
+#endif
+
 	for (i = 0; i < EFX_MAX_CHANNELS; i++)
 		kfree(efx->channel[i]);
 
@@ -3092,6 +3100,141 @@ void efx_update_sw_stats(struct efx_nic *efx, u64 *stats)
 	stats[GENERIC_STAT_rx_noskb_drops] = atomic_read(&efx->n_rx_noskb_drops);
 }
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right)
+{
+	if ((left->match_flags ^ right->match_flags) |
+	    ((left->flags ^ right->flags) &
+	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
+		return false;
+
+	return memcmp(&left->outer_vid, &right->outer_vid,
+		      sizeof(struct efx_filter_spec) -
+		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
+}
+
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec)
+{
+	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
+	return jhash2((const u32 *)&spec->outer_vid,
+		      (sizeof(struct efx_filter_spec) -
+		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
+		      0);
+}
+
+#ifdef CONFIG_RFS_ACCEL
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force)
+{
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_PENDING) {
+		/* ARFS is currently updating this entry, leave it */
+		return false;
+	}
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_ERROR) {
+		/* ARFS tried and failed to update this, so it's probably out
+		 * of date.  Remove the filter and the ARFS rule entry.
+		 */
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+		*force = true;
+		return true;
+	} else if (WARN_ON(rule->filter_id != filter_idx)) { /* can't happen */
+		/* ARFS has moved on, so old filter is not needed.  Since we did
+		 * not mark the rule with EFX_ARFS_FILTER_ID_REMOVING, it will
+		 * not be removed by efx_rps_hash_del() subsequently.
+		 */
+		*force = true;
+		return true;
+	}
+	/* Remove it iff ARFS wants to. */
+	return true;
+}
+
+struct hlist_head *efx_rps_hash_bucket(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec)
+{
+	u32 hash = efx_filter_spec_hash(spec);
+
+	WARN_ON(!spin_is_locked(&efx->rps_hash_lock));
+	if (!efx->rps_hash_table)
+		return NULL;
+	return &efx->rps_hash_table[hash % EFX_ARFS_HASH_TABLE_SIZE];
+}
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec))
+			return rule;
+	}
+	return NULL;
+}
+
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			*new = false;
+			return rule;
+		}
+	}
+	rule = kmalloc(sizeof(*rule), GFP_ATOMIC);
+	*new = true;
+	if (rule) {
+		memcpy(&rule->spec, spec, sizeof(rule->spec));
+		hlist_add_head(&rule->node, head);
+	}
+	return rule;
+}
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (WARN_ON(!head))
+		return;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			/* Someone already reused the entry.  We know that if
+			 * this check doesn't fire (i.e. filter_id == REMOVING)
+			 * then the REMOVING mark was put there by our caller,
+			 * because caller is holding a lock on filter table and
+			 * only holders of that lock set REMOVING.
+			 */
+			if (rule->filter_id != EFX_ARFS_FILTER_ID_REMOVING)
+				return;
+			hlist_del(node);
+			kfree(rule);
+			return;
+		}
+	}
+	/* We didn't find it. */
+	WARN_ON(1);
+}
+#endif
+
 /* RSS contexts.  We're using linked lists and crappy O(n) algorithms, because
  * (a) this is an infrequent control-plane operation and (b) n is small (max 64)
  */
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index a3140e16fcef..3f759ebdcf10 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -186,6 +186,27 @@ static inline void efx_filter_rfs_expire(struct work_struct *data) {}
 #endif
 bool efx_filter_is_mc_recipient(const struct efx_filter_spec *spec);
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right);
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec);
+
+#ifdef CONFIG_RFS_ACCEL
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force);
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec);
+
+/* @new is written to indicate if entry was newly added (true) or if an old
+ * entry was found and returned (false).
+ */
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new);
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec);
+#endif
+
 /* RSS contexts */
 struct efx_rss_context *efx_alloc_rss_context_entry(struct efx_nic *efx);
 struct efx_rss_context *efx_find_rss_context_entry(struct efx_nic *efx, u32 id);
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 7174ef5e5c5e..c72adf8b52ea 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -2905,18 +2905,45 @@ bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 {
 	struct efx_farch_filter_state *state = efx->filter_state;
 	struct efx_farch_filter_table *table;
-	bool ret = false;
+	bool ret = false, force = false;
+	u16 arfs_id;
 
 	down_write(&state->lock);
+	spin_lock_bh(&efx->rps_hash_lock);
 	table = &state->table[EFX_FARCH_FILTER_TABLE_RX_IP];
 	if (test_bit(index, table->used_bitmap) &&
-	    table->spec[index].priority == EFX_FILTER_PRI_HINT &&
-	    rps_may_expire_flow(efx->net_dev, table->spec[index].dmaq_id,
-				flow_id, 0)) {
-		efx_farch_filter_table_clear_entry(efx, table, index);
-		ret = true;
+	    table->spec[index].priority == EFX_FILTER_PRI_HINT) {
+		struct efx_arfs_rule *rule = NULL;
+		struct efx_filter_spec spec;
+
+		efx_farch_filter_to_gen_spec(&spec, &table->spec[index]);
+		if (!efx->rps_hash_table) {
+			/* In the absence of the table, we always returned 0 to
+			 * ARFS, so use the same to query it.
+			 */
+			arfs_id = 0;
+		} else {
+			rule = efx_rps_hash_find(efx, &spec);
+			if (!rule) {
+				/* ARFS table doesn't know of this filter, remove it */
+				force = true;
+			} else {
+				arfs_id = rule->arfs_id;
+				if (!efx_rps_check_rule(rule, index, &force))
+					goto out_unlock;
+			}
+		}
+		if (force || rps_may_expire_flow(efx->net_dev, spec.dmaq_id,
+						 flow_id, arfs_id)) {
+			if (rule)
+				rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+			efx_rps_hash_del(efx, &spec);
+			efx_farch_filter_table_clear_entry(efx, table, index);
+			ret = true;
+		}
 	}
-
+out_unlock:
+	spin_unlock_bh(&efx->rps_hash_lock);
 	up_write(&state->lock);
 	return ret;
 }
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index eea3808b3f25..65568925c3ef 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -734,6 +734,35 @@ struct efx_rss_context {
 };
 
 #ifdef CONFIG_RFS_ACCEL
+/* Order of these is important, since filter_id >= %EFX_ARFS_FILTER_ID_PENDING
+ * is used to test if filter does or will exist.
+ */
+#define EFX_ARFS_FILTER_ID_PENDING	-1
+#define EFX_ARFS_FILTER_ID_ERROR	-2
+#define EFX_ARFS_FILTER_ID_REMOVING	-3
+/**
+ * struct efx_arfs_rule - record of an ARFS filter and its IDs
+ * @node: linkage into hash table
+ * @spec: details of the filter (used as key for hash table).  Use efx->type to
+ *	determine which member to use.
+ * @rxq_index: channel to which the filter will steer traffic.
+ * @arfs_id: filter ID which was returned to ARFS
+ * @filter_id: index in software filter table.  May be
+ *	%EFX_ARFS_FILTER_ID_PENDING if filter was not inserted yet,
+ *	%EFX_ARFS_FILTER_ID_ERROR if filter insertion failed, or
+ *	%EFX_ARFS_FILTER_ID_REMOVING if expiry is currently removing the filter.
+ */
+struct efx_arfs_rule {
+	struct hlist_node node;
+	struct efx_filter_spec spec;
+	u16 rxq_index;
+	u16 arfs_id;
+	s32 filter_id;
+};
+
+/* Size chosen so that the table is one page (4kB) */
+#define EFX_ARFS_HASH_TABLE_SIZE	512
+
 /**
  * struct efx_async_filter_insertion - Request to asynchronously insert a filter
  * @net_dev: Reference to the netdevice
@@ -873,6 +902,10 @@ struct efx_async_filter_insertion {
  *	@rps_expire_channel's @rps_flow_id
  * @rps_slot_map: bitmap of in-flight entries in @rps_slot
  * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
+ * @rps_hash_lock: Protects ARFS filter mapping state (@rps_hash_table and
+ *	@rps_next_id).
+ * @rps_hash_table: Mapping between ARFS filters and their various IDs
+ * @rps_next_id: next arfs_id for an ARFS filter
  * @active_queues: Count of RX and TX queues that haven't been flushed and drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be flushed.
  *	Decremented when the efx_flush_rx_queue() is called.
@@ -1029,6 +1062,9 @@ struct efx_nic {
 	unsigned int rps_expire_index;
 	unsigned long rps_slot_map;
 	struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
+	spinlock_t rps_hash_lock;
+	struct hlist_head *rps_hash_table;
+	u32 rps_next_id;
 #endif
 
 	atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 9c593c661cbf..64a94f242027 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -834,9 +834,29 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	struct efx_nic *efx = netdev_priv(req->net_dev);
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
 	int slot_idx = req - efx->rps_slot;
+	struct efx_arfs_rule *rule;
+	u16 arfs_id = 0;
 	int rc;
 
 	rc = efx->type->filter_insert(efx, &req->spec, true);
+	if (efx->rps_hash_table) {
+		spin_lock_bh(&efx->rps_hash_lock);
+		rule = efx_rps_hash_find(efx, &req->spec);
+		/* The rule might have already gone, if someone else's request
+		 * for the same spec was already worked and then expired before
+		 * we got around to our work.  In that case we have nothing
+		 * tying us to an arfs_id, meaning that as soon as the filter
+		 * is considered for expiry it will be removed.
+		 */
+		if (rule) {
+			if (rc < 0)
+				rule->filter_id = EFX_ARFS_FILTER_ID_ERROR;
+			else
+				rule->filter_id = rc;
+			arfs_id = rule->arfs_id;
+		}
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 	if (rc >= 0) {
 		/* Remember this so we can check whether to expire the filter
 		 * later.
@@ -848,18 +868,18 @@ static void efx_filter_rfs_work(struct work_struct *data)
 
 		if (req->spec.ether_type == htons(ETH_P_IP))
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 		else
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 	}
 
 	/* Release references */
@@ -872,8 +892,10 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_async_filter_insertion *req;
+	struct efx_arfs_rule *rule;
 	struct flow_keys fk;
 	int slot_idx;
+	bool new;
 	int rc;
 
 	/* find a free slot */
@@ -926,12 +948,42 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	req->spec.rem_port = fk.ports.src;
 	req->spec.loc_port = fk.ports.dst;
 
+	if (efx->rps_hash_table) {
+		/* Add it to ARFS hash table */
+		spin_lock(&efx->rps_hash_lock);
+		rule = efx_rps_hash_add(efx, &req->spec, &new);
+		if (!rule) {
+			rc = -ENOMEM;
+			goto out_unlock;
+		}
+		if (new)
+			rule->arfs_id = efx->rps_next_id++ % RPS_NO_FILTER;
+		rc = rule->arfs_id;
+		/* Skip if existing or pending filter already does the right thing */
+		if (!new && rule->rxq_index == rxq_index &&
+		    rule->filter_id >= EFX_ARFS_FILTER_ID_PENDING)
+			goto out_unlock;
+		rule->rxq_index = rxq_index;
+		rule->filter_id = EFX_ARFS_FILTER_ID_PENDING;
+		spin_unlock(&efx->rps_hash_lock);
+	} else {
+		/* Without an ARFS hash table, we just use arfs_id 0 for all
+		 * filters.  This means if multiple flows hash to the same
+		 * flow_id, all but the most recently touched will be eligible
+		 * for expiry.
+		 */
+		rc = 0;
+	}
+
+	/* Queue the request */
 	dev_hold(req->net_dev = net_dev);
 	INIT_WORK(&req->work, efx_filter_rfs_work);
 	req->rxq_index = rxq_index;
 	req->flow_id = flow_id;
 	schedule_work(&req->work);
-	return 0;
+	return rc;
+out_unlock:
+	spin_unlock(&efx->rps_hash_lock);
 out_clear:
 	clear_bit(slot_idx, &efx->rps_slot_map);
 	return rc;

^ permalink raw reply related

* [PATCH] net: phy: allow scanning busses with missing phys
From: Alexandre Belloni @ 2018-04-24 16:09 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, David S . Miller, Allan Nielsen,
	Thomas Petazzoni, netdev, linux-kernel, Alexandre Belloni

Some MDIO busses will error out when trying to read a phy address with no
phy present at that address. In that case, probing the bus will fail
because __mdiobus_register() is scanning the bus for all possible phys
addresses.

In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
this address and set the phy ID to 0xffffffff which is then properly
handled in get_phy_device().

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
---
 drivers/net/phy/phy_device.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index ac23322a32e1..9e4ba8e80a18 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -535,8 +535,17 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id,
 
 	/* Grab the bits from PHYIR1, and put them in the upper half */
 	phy_reg = mdiobus_read(bus, addr, MII_PHYSID1);
-	if (phy_reg < 0)
+	if (phy_reg < 0) {
+		/* if there is no device, return without an error so scanning
+		 * the bus works properly
+		 */
+		if (phy_reg == -EIO || phy_reg == -ENODEV) {
+			*phy_id = 0xffffffff;
+			return 0;
+		}
+
 		return -EIO;
+	}
 
 	*phy_id = (phy_reg & 0xffff) << 16;
 
-- 
2.17.0

^ permalink raw reply related

* Re: simplify procfs code for seq_file instances
From: Christoph Hellwig @ 2018-04-24 16:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-rtc-u79uwXL29TY76Z2rM5mHXA, Alessandro Zummo,
	Alexandre Belloni, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Corey Minyard,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jfs-discussion-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Alexey Dobriyan,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Jiri Slaby, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20180424081916.e94ca8463fb3c39ebc082bdd-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On Tue, Apr 24, 2018 at 08:19:16AM -0700, Andrew Morton wrote:
> > > I want to ask if it is time to start using poorman function overloading
> > > with _b_c_e(). There are millions of allocation functions for example,
> > > all slightly difference, and people will add more. Seeing /proc interfaces
> > > doubled like this is painful.
> > 
> > Function overloading is totally unacceptable.
> > 
> > And I very much disagree with a tradeoff that keeps 5000 lines of 
> > code vs a few new helpers.
> 
> OK, the curiosity and suspense are killing me.  What the heck is
> "function overloading with _b_c_e()"?

The way I understood Alexey was to use have a proc_create macro
that can take different ops types.  Although the short cut for
__builtin_types_compatible_p would be _b_t_c or similar, so maybe
I misunderstood him.

^ permalink raw reply

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-04-24 16:03 UTC (permalink / raw)
  To: Georgios Amanakis, Cake List, netdev
In-Reply-To: <CACvFP_i_1BS2th952+6JnY5_u5LVOhzGLV5cKXcWyr-pc-UTcg@mail.gmail.com>

Georgios Amanakis <gamanakis@gmail.com> writes:

> On Tue, Apr 24, 2018 at 11:47 AM, Georgios Amanakis <gamanakis@gmail.com> wrote:
>>>
>>> Does anyone know if there is a way to do this so the module/builtin
>>> split doesn't bite us?
>>>
>> #ifdef CONFIG_NF_CONNTRACK ??

That is basically what we're doing. But it looks like there's an
IS_REACHABLE macro which does what we need. Will fix, thanks for
pointing it out :)

-Toke

^ permalink raw reply

* Re: [PATCH net-next] Revert "net: init sk_cookie for inet socket"
From: Yafang Shao @ 2018-04-24 15:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <26914c29-d248-5197-9e3c-fc44a1f5a1a8@gmail.com>

On Tue, Apr 24, 2018 at 11:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 04/24/2018 08:12 AM, Yafang Shao wrote:
>> On Tue, Apr 24, 2018 at 8:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>>
>>> On 04/24/2018 05:05 AM, Yafang Shao wrote:
>>>> This revert commit <c6849a3ac17e> ("net: init sk_cookie for inet socket")
>>>>
>>>> Per discussion with Eric.
>>>>
>>>
>>> I suggest you include a bit more details, about cache line false sharing.
>>>
>>
>> Coud we adjust the struct common to avoid such kind of cache line
>> false sharing ?
>> I mean removing "atomic64_t  skc_cookie;" from struct sock_common and
>> place it in struct inet_sock ?
>
> The false sharing is not there, it is on net->cookie_gen
>

Yes.
This is the current issue.
May be we should adjust struct net as well.

Regarding sk_cookie, as it is only used by inet_sock now, may be it is
better placed in srtuct inet_sock ?

Thanks
Yafang

^ permalink raw reply

* Re: [PATCH net-next 2/2] net/ipv6: Fix missing rcu dereferences on from
From: Eric Dumazet @ 2018-04-24 15:56 UTC (permalink / raw)
  To: David Ahern, netdev
In-Reply-To: <ca16fd6f-f694-8a0f-e19e-26fd54b7f978@gmail.com>



On 04/24/2018 08:54 AM, Eric Dumazet wrote:
> 
> 
> On 04/23/2018 11:32 AM, David Ahern wrote:
>> kbuild test robot reported 2 uses of rt->from not properly accessed
>> using rcu_dereference:
>> 1. add rcu_dereference_protected to rt6_remove_exception_rt and make
>>    sure it is always called with rcu lock held.
>>
>> 2. change rt6_do_redirect to take a reference on 'from' when accessed
>>    the first time so it can be used the sceond time outside of the lock
>>
>> Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected")
>> Reported-by: kbuild test robot <lkp@intel.com>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> ---
>>  net/ipv6/route.c | 15 ++++++++++-----
>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 354a5b8d016f..ac3e51631c65 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1541,11 +1541,13 @@ static struct rt6_info *rt6_find_cached_rt(struct fib6_info *rt,
>>  static int rt6_remove_exception_rt(struct rt6_info *rt)
>>  {
>>  	struct rt6_exception_bucket *bucket;
>> -	struct fib6_info *from = rt->from;
>>  	struct in6_addr *src_key = NULL;
>>  	struct rt6_exception *rt6_ex;
>> +	struct fib6_info *from;
>>  	int err;
>>  
>> +	from = rcu_dereference_protected(rt->from,
>> +					 lockdep_is_held(&rt6_exception_lock));
> 
> This does not make any sense.
> 
> We lock rt6_exception_lock a bit later in this function (line 1558)
> 
> If we really were holding rt6_exception_lock here we would dead lock.

I will send this fix :

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ac3e51631c659b5c5c8a93c17011cb7f3ad266e2..432c4bcc1111085671f32987e4673e47898085a3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1546,8 +1546,7 @@ static int rt6_remove_exception_rt(struct rt6_info *rt)
        struct fib6_info *from;
        int err;
 
-       from = rcu_dereference_protected(rt->from,
-                                        lockdep_is_held(&rt6_exception_lock));
+       from = rcu_dereference(rt->from);
        if (!from ||
            !(rt->rt6i_flags & RTF_CACHE))
                return -EINVAL;

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox