Netdev List
 help / color / mirror / Atom feed
* [PATCH 3/4] net-next: dsa: b53: remove empty set_addr() stub
From: John Crispin @ 2016-09-19 13:28 UTC (permalink / raw)
  To: David S. Miller, Andrew Lunn, Florian Fainelli
  Cc: netdev, linux-kernel, John Crispin
In-Reply-To: <1474291683-44167-1-git-send-email-john@phrozen.org>

The set_addr() callback is now optional. Remove the empty stub that b53
has.

Signed-off-by: John Crispin <john@phrozen.org>
---
 drivers/net/dsa/b53/b53_common.c |    6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 0afc2e5..1a492c0 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -764,11 +764,6 @@ static int b53_get_sset_count(struct dsa_switch *ds)
 	return b53_get_mib_size(dev);
 }
 
-static int b53_set_addr(struct dsa_switch *ds, u8 *addr)
-{
-	return 0;
-}
-
 static int b53_setup(struct dsa_switch *ds)
 {
 	struct b53_device *dev = ds->priv;
@@ -1466,7 +1461,6 @@ static enum dsa_tag_protocol b53_get_tag_protocol(struct dsa_switch *ds)
 static struct dsa_switch_ops b53_switch_ops = {
 	.get_tag_protocol	= b53_get_tag_protocol,
 	.setup			= b53_setup,
-	.set_addr		= b53_set_addr,
 	.get_strings		= b53_get_strings,
 	.get_ethtool_stats	= b53_get_ethtool_stats,
 	.get_sset_count		= b53_get_sset_count,
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/4] net-next: dsa: set_addr should be optional
From: John Crispin @ 2016-09-19 13:27 UTC (permalink / raw)
  To: David S. Miller, Andrew Lunn, Florian Fainelli
  Cc: netdev, linux-kernel, John Crispin

The Marvell driver is the only one that actually sets the switches HW
address. All other drivers have an empty stub. fix this by making the
callback optional.

John Crispin (4):
  net-next: dsa: fix duplicate invocation of set_addr()
  net-next: dsa: make the set_addr() operation optional
  net-next: dsa: b53: remove empty set_addr() stub
  net-next: dsa: qca8k: remove empty set_addr() stub

 drivers/net/dsa/b53/b53_common.c |    6 ------
 drivers/net/dsa/qca8k.c          |    8 --------
 net/dsa/dsa.c                    |    8 +++++---
 net/dsa/dsa2.c                   |   12 +++++-------
 4 files changed, 10 insertions(+), 24 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* RE: [PATCH net] qed: Fix stack corruption on probe
From: David Laight @ 2016-09-19 13:19 UTC (permalink / raw)
  To: 'Yuval Mintz', davem@davemloft.net,
	netdev@vger.kernel.org
  Cc: Yuval Mintz
In-Reply-To: <1474186483-30831-1-git-send-email-Yuval.Mintz@qlogic.com>

From: Yuval Mintz
> Sent: 18 September 2016 09:15
> Commit fe56b9e6a8d95 ("qed: Add module with basic common support")
> has introduced a stack corruption during probe, where filling a
> local struct with data to be sent to management firmware is incorrectly
> filled; The data is written outside of the struct and corrupts
> the stack.
> 
> Fixes: fe56b9e6a8d95 ("qed: Add module with basic common support")
> Signed-off-by: Yuval Mintz <Yuval.Mintz@caviumnetworks.com>
> ---
> Hi Dave,
> 
> In case it isn't obvious at first glance, the corruption is due
> to the next line in the for-loop, which isn't changed by the patch.
> 
> Please consider applying this to `net'.
> 
> Thanks,
> Yuval
> ---
>  drivers/net/ethernet/qlogic/qed/qed_mcp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> index a240f26..69f5b04 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> @@ -1153,8 +1153,8 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn,
>  	p_drv_version = &union_data.drv_version;
>  	p_drv_version->version = p_ver->version;
> 
> -	for (i = 0; i < MCP_DRV_VER_STR_SIZE - 1; i += 4) {
> -		val = cpu_to_be32(p_ver->name[i]);
> +	for (i = 0; i < (MCP_DRV_VER_STR_SIZE - 4) / sizeof(u32); i++) {
> +		val = cpu_to_be32(p_ver->name[i * sizeof(u32)]);
>  		*(__be32 *)&p_drv_version->name[i * sizeof(u32)] = val;
>  	}

That change fails the 'sanity' test.
It looks like it is copying a string that might by byteswapped into host order.
I'm guessing that the target p_drv_version->name[] is char [].
The cpu_to_be32() call is only correct if p_ver->name[] is 32bit.
But you are (now) indexing it with the character offset.
So the data copied cannot be right.

I also suspect it should be be32_to_cpu().

	David

^ permalink raw reply

* Re: [PATCH] net: phy: Ensure the state machine is called when phy is UP
From: Nicolas Ferre @ 2016-09-19 13:15 UTC (permalink / raw)
  To: Alexandre Belloni, Andrew Lunn, Florian Fainelli, netdev
  Cc: David S . Miller, linux-kernel
In-Reply-To: <20160415224559.GH25196@piout.net>

Hi all,

I come back to this thread to re-start the conversation as I still have
the issue...


Le 16/04/2016 à 00:45, Alexandre Belloni a écrit :
> On 16/04/2016 at 00:30:26 +0200, Andrew Lunn wrote :
>> On Sat, Apr 16, 2016 at 12:17:11AM +0200, Alexandre Belloni wrote:
>>> On 16/04/2016 at 00:05:08 +0200, Andrew Lunn wrote :
>>>>> Trace without my patch:
>>>>> libphy: MACB_mii_bus: probed
>>>>> macb f8020000.ethernet eth0: Cadence GEM rev 0x00020120 at 0xf8020000 irq 27 (fc:c2:3d:0c:6e:05)
>>>>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: attached PHY driver [Micrel KSZ8081 or KSZ8091] (mii_bus:phy_addr=f8020000.etherne:01, irq=171)
>>>>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: PHY state change READY -> READY
>>>>> [...]
>>>>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: PHY state change READY -> READY
>>>>
>>>> Are there some state changes before this? How is it getting to state
>>>> READY? It would expect it to start in DOWN, from when the phy device
>>>> was created in phy_device_create().
>>>>
>>>
>>> No other changes. I forgot to mention that this is when booting with a
>>> cable plugged in. Unplugging and replugging the cable makes the link
>>> detection work fine even without the patch.
>>
>> Are you tftpbooting? I.e. has the boot loader already done an auto
>> negotiation?
>>
> 
> Yes.

Yes indeed: this is my use-case: load the kernel from U-Boot using tftp
and having the rootfs in NAND flash so, no NFS rootfs for me.

>> I've looked at the code and i still don't see how it gets to READY.
>> What i do see is that when you connect the phy to the MAC, the
>> interrupt handler is installed. So maybe there are some PHY interrupts
>> before the interface is opened? Could you put a print in
>> phy_interrupt().
>>
> 
> That is indeed the case, and I'm not sure why because
> 99f81afc139c6edd14d77a91ee91685a414a1c66 is trying to disable AN at
> boot.

I don't know what happens to the phy, but this patch does fix the issue
for me:
Tested-by: Nicolas Ferre <nicolas.ferre@atmel.com>

The other alternative that I'm considering seriously as I'm still
struggled with this is to simply remove the phy IRQ from my board DT.

Best regards,
-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH] net: skbuff: Fix length validation in skb_vlan_pop()
From: Shmulik Ladkani @ 2016-09-19 13:05 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: pravin shelar, Jiri Pirko, David S . Miller,
	Linux Kernel Network Developers
In-Reply-To: <57DFD8A1.1090103@iogearbox.net>

On Mon, 19 Sep 2016 14:22:57 +0200, daniel@iogearbox.net wrote:
> On 09/19/2016 08:15 AM, Shmulik Ladkani wrote:
> > On Sun, 18 Sep 2016 13:26:30 -0700, pshelar@ovn.org wrote:
> >> On Sun, Sep 18, 2016 at 3:09 AM, Shmulik Ladkani
> >> <shmulik.ladkani@gmail.com> wrote:
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index 1e329d4112..cc2c004838 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
> >>>          } else {
> >>>                  if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
> >>>                                skb->protocol != htons(ETH_P_8021AD)) ||
> >>> -                            skb->len < VLAN_ETH_HLEN))
> >>> +                            skb->mac_len < VLAN_ETH_HLEN))
> >>
> >> There is already check in __skb_vlan_pop() to validate skb for a vlan
> >> header. So it is safe to drop this check entirely.
> >
> > Seems validation in '__skb_vlan_pop' has slightly different semantics:
> >
> > 	unsigned int offset = skb->data - skb_mac_header(skb);
> >
> > 	__skb_push(skb, offset);
> > 	err = skb_ensure_writable(skb, VLAN_ETH_HLEN);
> >
> > this pushes 'data' back to mac_header, then makes sure there's sufficient
> > place in skb to _store_ VLAN_ETH_HLEN bytes (by pulling into linear part
> > if needed, or erroring if skb is too small).
> 
> Yes, but this skb_ensure_writable() is needed for doing the memmove anyway.

Had no intention dropping the skb_ensure_writable from __skb_vlan_pop :)

> > There's no guarantee the original mac header was sized VLAN_ETH_HLEN.
> 
> I'm wondering, what happens when you'd call this on tx path, when you'd
> change that to suggested skb->mac_len? Isn't that 0 in such case, thus
> such setups could fail then?

Thanks, I think you're right.

Seems __dev_queue_xmit needs a 'skb_reset_mac_len' right after call to
'skb_reset_mac_header' (or maybe call 'skb_reset_mac_len' from within
skb_reset_mac_header?).

Also, I'm okay with removing the excess 'skb->mac_len < VLAN_ETH_HLEN'
condition if it is agreed that the "tag exists but insufficient to pop"
semantic is no longer wanted.

Regards,
Shmulik

^ permalink raw reply

* [PATCHv2] sunrpc: fix write space race causing stalls
From: David Vrabel @ 2016-09-19 12:58 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA
  Cc: David Vrabel, Trond Myklebust, Anna Schumaker,
	netdev-u79uwXL29TY76Z2rM5mHXA

Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
    SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
    this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
    which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
    is supposed to handle the above race) does not call
    xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
    thus the task is not woken.

Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
so the second call to sk->sk_write_space() calls xprt_write_space().

Suggested-by: Trond Myklebust <trondmy-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>
Signed-off-by: David Vrabel <david.vrabel-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtsock.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bf16883..e72581d 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -473,7 +473,16 @@ static int xs_nospace(struct rpc_task *task)
 	spin_unlock_bh(&xprt->transport_lock);
 
 	/* Race breaker in case memory is freed before above code is called */
-	sk->sk_write_space(sk);
+	if (ret == -EAGAIN) {
+		struct socket_wq *wq;
+
+		rcu_read_lock();
+		wq = rcu_dereference(sk->sk_wq);
+		set_bit(SOCKWQ_ASYNC_NOSPACE, &wq->flags);
+		rcu_read_unlock();
+
+		sk->sk_write_space(sk);
+	}
 	return ret;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 1/2] ptp_clock: allow for it to be optional
From: Eugenia Emantayev @ 2016-09-19 12:25 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: John Stultz, Thomas Gleixner, Richard Cochran, Josh Triplett,
	netdev, linux-kernel
In-Reply-To: <1474257070-4255-2-git-send-email-nicolas.pitre@linaro.org>

Reviewed-by: Eugenia Emantayev <eugenia@mellanox.com>

^ permalink raw reply

* Re: [PATCH 1/2] ptp_clock: allow for it to be optional
From: Jiri Benc @ 2016-09-19 12:25 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: John Stultz, Thomas Gleixner, Richard Cochran, Josh Triplett,
	netdev, linux-kernel
In-Reply-To: <1474257070-4255-2-git-send-email-nicolas.pitre@linaro.org>

On Sun, 18 Sep 2016 23:51:09 -0400, Nicolas Pitre wrote:
> And to make it possible for PTP to be configured out, the select statement
> in the Kconfig entry for those ethernet drivers is changed from selecting
> PTP_1588_CLOCK to PTP_1588_CLOCK_SELECTED whose purpose is to indicate the
> default Kconfig value for the PTP subsystem.

With this patch applied, the user is free to set a NIC driver as built
in and PTP_1588_CLOCK as a module, right? If so, that would lead to
non-functional PTP without any warning due to the use of IS_REACHABLE.
That doesn't sound right. Could easily cause hours of headache to
someone.

Or is this handled somehow?

Thanks,

 Jiri

^ permalink raw reply

* Re: [PATCH] net: skbuff: Fix length validation in skb_vlan_pop()
From: Daniel Borkmann @ 2016-09-19 12:22 UTC (permalink / raw)
  To: Shmulik Ladkani, pravin shelar
  Cc: Jiri Pirko, David S . Miller, Linux Kernel Network Developers
In-Reply-To: <20160919091538.56defd2b@pixies>

On 09/19/2016 08:15 AM, Shmulik Ladkani wrote:
> On Sun, 18 Sep 2016 13:26:30 -0700, pshelar@ovn.org wrote:
>> On Sun, Sep 18, 2016 at 3:09 AM, Shmulik Ladkani
>> <shmulik.ladkani@gmail.com> wrote:
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 1e329d4112..cc2c004838 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
>>>          } else {
>>>                  if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
>>>                                skb->protocol != htons(ETH_P_8021AD)) ||
>>> -                            skb->len < VLAN_ETH_HLEN))
>>> +                            skb->mac_len < VLAN_ETH_HLEN))
>>
>> There is already check in __skb_vlan_pop() to validate skb for a vlan
>> header. So it is safe to drop this check entirely.
>
> Seems validation in '__skb_vlan_pop' has slightly different semantics:
>
> 	unsigned int offset = skb->data - skb_mac_header(skb);
>
> 	__skb_push(skb, offset);
> 	err = skb_ensure_writable(skb, VLAN_ETH_HLEN);
>
> this pushes 'data' back to mac_header, then makes sure there's sufficient
> place in skb to _store_ VLAN_ETH_HLEN bytes (by pulling into linear part
> if needed, or erroring if skb is too small).

Yes, but this skb_ensure_writable() is needed for doing the memmove anyway.

> There's no guarantee the original mac header was sized VLAN_ETH_HLEN.

I'm wondering, what happens when you'd call this on tx path, when you'd
change that to suggested skb->mac_len? Isn't that 0 in such case, thus
such setups could fail then?

^ permalink raw reply

* Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking
From: Michal Hocko @ 2016-09-19 12:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, David S. Miller, linux-mm, cgroups,
	netdev, linux-kernel, kernel-team, Vladimir Davydov
In-Reply-To: <20160914194846.11153-3-hannes@cmpxchg.org>

[Fixup Vladimir's email]

same here I do not feel familiar with the code enough to give my ack but
Vladimir might be in a better position

On Wed 14-09-16 15:48:46, Johannes Weiner wrote:
> The cgroup core and the memory controller need to track socket
> ownership for different purposes, but the tracking sites being
> entirely different is kind of ugly.
> 
> Be a better citizen and rename the memory controller callbacks to
> match the cgroup core callbacks, then move them to the same place.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  4 ++--
>  mm/memcontrol.c            | 19 +++++++++++--------
>  net/core/sock.c            |  6 +++---
>  net/ipv4/tcp.c             |  2 --
>  net/ipv4/tcp_ipv4.c        |  3 ---
>  5 files changed, 16 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0710143723bc..ca11b3e6dd65 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -773,8 +773,8 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
>  struct sock;
> -void sock_update_memcg(struct sock *sk);
> -void sock_release_memcg(struct sock *sk);
> +void mem_cgroup_sk_alloc(struct sock *sk);
> +void mem_cgroup_sk_free(struct sock *sk);
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
>  void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
>  #ifdef CONFIG_MEMCG
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 60bb830abc34..2caf1ee86e78 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2939,7 +2939,7 @@ static int memcg_update_tcp_limit(struct mem_cgroup *memcg, unsigned long limit)
>  		/*
>  		 * The active flag needs to be written after the static_key
>  		 * update. This is what guarantees that the socket activation
> -		 * function is the last one to run. See sock_update_memcg() for
> +		 * function is the last one to run. See mem_cgroup_sk_alloc() for
>  		 * details, and note that we don't mark any socket as belonging
>  		 * to this memcg until that flag is up.
>  		 *
> @@ -2948,7 +2948,7 @@ static int memcg_update_tcp_limit(struct mem_cgroup *memcg, unsigned long limit)
>  		 * as accounted, but the accounting functions are not patched in
>  		 * yet, we'll lose accounting.
>  		 *
> -		 * We never race with the readers in sock_update_memcg(),
> +		 * We never race with the readers in mem_cgroup_sk_alloc(),
>  		 * because when this value change, the code to process it is not
>  		 * patched in yet.
>  		 */
> @@ -5651,11 +5651,15 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
>  DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
>  EXPORT_SYMBOL(memcg_sockets_enabled_key);
>  
> -void sock_update_memcg(struct sock *sk)
> +void mem_cgroup_sk_alloc(struct sock *sk)
>  {
>  	struct mem_cgroup *memcg;
>  
> -	/* Socket cloning can throw us here with sk_cgrp already
> +	if (!mem_cgroup_sockets_enabled)
> +		return;
> +
> +	/*
> +	 * Socket cloning can throw us here with sk_memcg already
>  	 * filled. It won't however, necessarily happen from
>  	 * process context. So the test for root memcg given
>  	 * the current task's memcg won't help us in this case.
> @@ -5680,12 +5684,11 @@ void sock_update_memcg(struct sock *sk)
>  out:
>  	rcu_read_unlock();
>  }
> -EXPORT_SYMBOL(sock_update_memcg);
>  
> -void sock_release_memcg(struct sock *sk)
> +void mem_cgroup_sk_free(struct sock *sk)
>  {
> -	WARN_ON(!sk->sk_memcg);
> -	css_put(&sk->sk_memcg->css);
> +	if (sk->sk_memcg)
> +		css_put(&sk->sk_memcg->css);
>  }
>  
>  /**
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 038e660ef844..c73e28fc9c2a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1363,6 +1363,7 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
>  	slab = prot->slab;
>  
>  	cgroup_sk_free(&sk->sk_cgrp_data);
> +	mem_cgroup_sk_free(sk);
>  	security_sk_free(sk);
>  	if (slab != NULL)
>  		kmem_cache_free(slab, sk);
> @@ -1399,6 +1400,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		sock_net_set(sk, net);
>  		atomic_set(&sk->sk_wmem_alloc, 1);
>  
> +		mem_cgroup_sk_alloc(sk);
>  		cgroup_sk_alloc(&sk->sk_cgrp_data);
>  		sock_update_classid(&sk->sk_cgrp_data);
>  		sock_update_netprioidx(&sk->sk_cgrp_data);
> @@ -1545,6 +1547,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		newsk->sk_incoming_cpu = raw_smp_processor_id();
>  		atomic64_set(&newsk->sk_cookie, 0);
>  
> +		mem_cgroup_sk_alloc(newsk);
>  		cgroup_sk_alloc(&newsk->sk_cgrp_data);
>  
>  		/*
> @@ -1569,9 +1572,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		sk_set_socket(newsk, NULL);
>  		newsk->sk_wq = NULL;
>  
> -		if (mem_cgroup_sockets_enabled && sk->sk_memcg)
> -			sock_update_memcg(newsk);
> -
>  		if (newsk->sk_prot->sockets_allocated)
>  			sk_sockets_allocated_inc(newsk);
>  
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index a13fcb369f52..fc76ef51a5f4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -421,8 +421,6 @@ void tcp_init_sock(struct sock *sk)
>  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>  
>  	local_bh_disable();
> -	if (mem_cgroup_sockets_enabled)
> -		sock_update_memcg(sk);
>  	sk_sockets_allocated_inc(sk);
>  	local_bh_enable();
>  }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 04b989328558..b8fc74a66299 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1872,9 +1872,6 @@ void tcp_v4_destroy_sock(struct sock *sk)
>  	local_bh_disable();
>  	sk_sockets_allocated_dec(sk);
>  	local_bh_enable();
> -
> -	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
> -		sock_release_memcg(sk);
>  }
>  EXPORT_SYMBOL(tcp_v4_destroy_sock);
>  
> -- 
> 2.9.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/3] cgroup: duplicate cgroup reference when cloning sockets
From: Michal Hocko @ 2016-09-19 12:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, David S. Miller, linux-mm, cgroups,
	netdev, linux-kernel, kernel-team, Vladimir Davydov
In-Reply-To: <20160914194846.11153-2-hannes@cmpxchg.org>

[Fixup Vladimir's email]

I am not familiar with this code path to give my ack, unfortunatelly.

On Wed 14-09-16 15:48:45, Johannes Weiner wrote:
> From: Johannes Weiner <jweiner@fb.com>
> 
> When a socket is cloned, the associated sock_cgroup_data is duplicated
> but not its reference on the cgroup. As a result, the cgroup reference
> count will underflow when both sockets are destroyed later on.
> 
> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
> Cc: <stable@vger.kernel.org> # 4.5+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  kernel/cgroup.c | 6 ++++++
>  net/core/sock.c | 5 ++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 0c4db7908264..b0d727d26fc7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -6297,6 +6297,12 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
>  	if (cgroup_sk_alloc_disabled)
>  		return;
>  
> +	/* Socket clone path */
> +	if (skcd->val) {
> +		cgroup_get(sock_cgroup_ptr(skcd));
> +		return;
> +	}
> +
>  	rcu_read_lock();
>  
>  	while (true) {
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 51a730485649..038e660ef844 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1340,7 +1340,6 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
>  		if (!try_module_get(prot->owner))
>  			goto out_free_sec;
>  		sk_tx_queue_clear(sk);
> -		cgroup_sk_alloc(&sk->sk_cgrp_data);
>  	}
>  
>  	return sk;
> @@ -1400,6 +1399,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		sock_net_set(sk, net);
>  		atomic_set(&sk->sk_wmem_alloc, 1);
>  
> +		cgroup_sk_alloc(&sk->sk_cgrp_data);
>  		sock_update_classid(&sk->sk_cgrp_data);
>  		sock_update_netprioidx(&sk->sk_cgrp_data);
>  	}
> @@ -1544,6 +1544,9 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		newsk->sk_priority = 0;
>  		newsk->sk_incoming_cpu = raw_smp_processor_id();
>  		atomic64_set(&newsk->sk_cookie, 0);
> +
> +		cgroup_sk_alloc(&newsk->sk_cgrp_data);
> +
>  		/*
>  		 * Before updating sk_refcnt, we must commit prior changes to memory
>  		 * (Documentation/RCU/rculist_nulls.txt for details)
> -- 
> 2.9.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/3] mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting
From: Michal Hocko @ 2016-09-19 12:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, David S. Miller, linux-mm, cgroups,
	netdev, linux-kernel, kernel-team, Vladimir Davydov
In-Reply-To: <20160914194846.11153-1-hannes@cmpxchg.org>

[Fixup Vladimir's email]

On Wed 14-09-16 15:48:44, Johannes Weiner wrote:
> From: Johannes Weiner <jweiner@fb.com>
> 
> During cgroup2 rollout into production, we started encountering css
> refcount underflows and css access crashes in the memory controller.
> Splitting the heavily shared css reference counter into logical users
> narrowed the imbalance down to the cgroup2 socket memory accounting.
> 
> The problem turns out to be the per-cpu charge cache. Cgroup1 had a
> separate socket counter, but the new cgroup2 socket accounting goes
> through the common charge path that uses a shared per-cpu cache for
> all memory that is being tracked. Those caches are safe against
> scheduling preemption, but not against interrupts - such as the newly
> added packet receive path. When cache draining is interrupted by
> network RX taking pages out of the cache, the resuming drain operation
> will put references of in-use pages, thus causing the imbalance.
> 
> Disable IRQs during all per-cpu charge cache operations.
> 
> Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
> Cc: <stable@vger.kernel.org> # 4.5+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 31 ++++++++++++++++++++++---------
>  1 file changed, 22 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7a8d6624758a..60bb830abc34 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1710,17 +1710,22 @@ static DEFINE_MUTEX(percpu_charge_mutex);
>  static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>  	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
>  	bool ret = false;
>  
>  	if (nr_pages > CHARGE_BATCH)
>  		return ret;
>  
> -	stock = &get_cpu_var(memcg_stock);
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
>  	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
>  		stock->nr_pages -= nr_pages;
>  		ret = true;
>  	}
> -	put_cpu_var(memcg_stock);
> +
> +	local_irq_restore(flags);
> +
>  	return ret;
>  }
>  
> @@ -1741,15 +1746,18 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  	stock->cached = NULL;
>  }
>  
> -/*
> - * This must be called under preempt disabled or must be called by
> - * a thread which is pinned to local cpu.
> - */
>  static void drain_local_stock(struct work_struct *dummy)
>  {
> -	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
>  	drain_stock(stock);
>  	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +
> +	local_irq_restore(flags);
>  }
>  
>  /*
> @@ -1758,14 +1766,19 @@ static void drain_local_stock(struct work_struct *dummy)
>   */
>  static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
>  
> +	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached != memcg) { /* reset if necessary */
>  		drain_stock(stock);
>  		stock->cached = memcg;
>  	}
>  	stock->nr_pages += nr_pages;
> -	put_cpu_var(memcg_stock);
> +
> +	local_irq_restore(flags);
>  }
>  
>  /*
> -- 
> 2.9.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net-next 2/3] i40e: fix MSI-X vector redistribution if hw limit is reached
From: Stefan Assmann @ 2016-09-19 11:37 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann
In-Reply-To: <1474285071-28218-1-git-send-email-sassmann@kpanic.de>

The driver allocates 1 vector per CPU thread and the current hardware
limit for vectors is 129 per PF. On systems with 128 or more threads
this currently means all vectors are used by the PF leaving no room for
additional features like VMDq, iWARP, etc...
The code that should redistribute the vectors in this case is broken and
never triggers. Fixed the code so that it actually triggers if the
hardware limit is reached and adjust the number of queue pairs
accordingly.
Also the number of initially requested iWARP vectors was not properly
saved when the vector limit was reached, and therefore always zero.

Comparison with debug statement.
Before:
i40e 0000:2d:00.0: VMDq disabled, not enough MSI-X vectors
i40e 0000:2d:00.0: IWARP disabled, not enough MSI-X vectors
i40e 00.0 MSI-X vector distribution: PF 128, VMDq 0, FDSB 0, iWARP 0
After:
i40e 0000:2d:00.0: MSI-X vector limit reached, attempting to redistribute vectors
i40e 00.0 MSI-X vector distribution: PF 78, VMDq 8, FDSB 0, iWARP 42

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 38 +++++++++++++++++------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5f3f1c8..4727c02 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7665,6 +7665,8 @@ static int i40e_init_msix(struct i40e_pf *pf)
 #endif
 	/* can we reserve enough for iWARP? */
 	if (pf->flags & I40E_FLAG_IWARP_ENABLED) {
+		iwarp_requested = pf->num_iwarp_msix;
+
 		if (!vectors_left)
 			pf->num_iwarp_msix = 0;
 		else if (vectors_left < pf->num_iwarp_msix)
@@ -7706,21 +7708,6 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		pf->msix_entries[i].entry = i;
 	v_actual = i40e_reserve_msix_vectors(pf, v_budget);
 
-	if (v_actual != v_budget) {
-		/* If we have limited resources, we will start with no vectors
-		 * for the special features and then allocate vectors to some
-		 * of these features based on the policy and at the end disable
-		 * the features that did not get any vectors.
-		 */
-		iwarp_requested = pf->num_iwarp_msix;
-		pf->num_iwarp_msix = 0;
-#ifdef I40E_FCOE
-		pf->num_fcoe_qps = 0;
-		pf->num_fcoe_msix = 0;
-#endif
-		pf->num_vmdq_msix = 0;
-	}
-
 	if (v_actual < I40E_MIN_MSIX) {
 		pf->flags &= ~I40E_FLAG_MSIX_ENABLED;
 		kfree(pf->msix_entries);
@@ -7734,9 +7721,16 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		pf->num_lan_qps = 1;
 		pf->num_lan_msix = 1;
 
-	} else if (v_actual != v_budget) {
+	} else if (!vectors_left) {
+		/* If we have limited resources, we will start with no vectors
+		 * for the special features and then allocate vectors to some
+		 * of these features based on the policy and at the end disable
+		 * the features that did not get any vectors.
+		 */
 		int vec;
 
+		dev_info(&pf->pdev->dev,
+			 "MSI-X vector limit reached, attempting to redistribute vectors\n");
 		/* reserve the misc vector */
 		vec = v_actual - 1;
 
@@ -7744,6 +7738,10 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		pf->num_vmdq_msix = 1;    /* force VMDqs to only one vector */
 		pf->num_vmdq_vsis = 1;
 		pf->num_vmdq_qps = 1;
+#ifdef I40E_FCOE
+		pf->num_fcoe_qps = 0;
+		pf->num_fcoe_msix = 0;
+#endif
 		pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
 
 		/* partition out the remaining vectors */
@@ -7779,6 +7777,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 			pf->num_lan_msix = min_t(int,
 			       (vec - (pf->num_iwarp_msix + pf->num_vmdq_vsis)),
 							      pf->num_lan_msix);
+			pf->num_lan_qps = pf->num_lan_msix;
 #ifdef I40E_FCOE
 			/* give one vector to FCoE */
 			if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
@@ -7808,6 +7807,13 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		pf->flags &= ~I40E_FLAG_FCOE_ENABLED;
 	}
 #endif
+	i40e_debug(&pf->hw, I40E_DEBUG_INIT,
+		   "MSI-X vector distribution: PF %d, VMDq %d, FDSB %d, iWARP %d\n",
+		   pf->num_lan_msix,
+		   pf->num_vmdq_msix * pf->num_vmdq_vsis,
+		   pf->num_fdsb_msix,
+		   pf->num_iwarp_msix);
+
 	return v_actual;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/3] i40e: fix sideband flow director vector allocation
From: Stefan Assmann @ 2016-09-19 11:37 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann
In-Reply-To: <1474285071-28218-1-git-send-email-sassmann@kpanic.de>

Currently if the MSI-X vector limit is reached the sideband flow
director gets disabled. A bit too early to make that decision, as
vectors may get re-distributed. So move the check further back.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 4727c02..19341fc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7645,7 +7645,6 @@ static int i40e_init_msix(struct i40e_pf *pf)
 			vectors_left--;
 		} else {
 			pf->num_fdsb_msix = 0;
-			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
 		}
 	}
 
@@ -7742,7 +7741,6 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		pf->num_fcoe_qps = 0;
 		pf->num_fcoe_msix = 0;
 #endif
-		pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
 
 		/* partition out the remaining vectors */
 		switch (vec) {
@@ -7774,6 +7772,10 @@ static int i40e_init_msix(struct i40e_pf *pf)
 				pf->num_vmdq_vsis = min_t(int, (vec / 2),
 						  I40E_DEFAULT_NUM_VMDQ_VSI);
 			}
+			if (pf->flags & I40E_FLAG_FD_SB_ENABLED) {
+				pf->num_fdsb_msix = 1;
+				vec--;
+			}
 			pf->num_lan_msix = min_t(int,
 			       (vec - (pf->num_iwarp_msix + pf->num_vmdq_vsis)),
 							      pf->num_lan_msix);
@@ -7789,6 +7791,11 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		}
 	}
 
+	if ((pf->flags & I40E_FLAG_FD_SB_ENABLED) &&
+	    (pf->num_fdsb_msix == 0)) {
+		dev_info(&pf->pdev->dev, "Sideband Flowdir disabled, not enough MSI-X vectors\n");
+		pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
+	}
 	if ((pf->flags & I40E_FLAG_VMDQ_ENABLED) &&
 	    (pf->num_vmdq_msix == 0)) {
 		dev_info(&pf->pdev->dev, "VMDq disabled, not enough MSI-X vectors\n");
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/3] i40e: fix MSI-X vector allocation for > 128 CPU threads
From: Stefan Assmann @ 2016-09-19 11:37 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann

Stefan Assmann (3):
  i40e: check if vectors are already depleted when doing VMDq allocation
  i40e: fix MSI-X vector redistribution if hw limit is reached
  i40e: fix sideband flow director vector allocation

 drivers/net/ethernet/intel/i40e/i40e_main.c | 76 ++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 29 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next 1/3] i40e: check if vectors are already depleted when doing VMDq allocation
From: Stefan Assmann @ 2016-09-19 11:37 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann
In-Reply-To: <1474285071-28218-1-git-send-email-sassmann@kpanic.de>

During MSI-X vector allocation for VMDq, a check for "no vectors left"
was missing, add it. This prevents more vectors to be allocated than
available.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 61b0fc4..5f3f1c8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7678,18 +7678,23 @@ static int i40e_init_msix(struct i40e_pf *pf)
 		int vmdq_vecs_wanted = pf->num_vmdq_vsis * pf->num_vmdq_qps;
 		int vmdq_vecs = min_t(int, vectors_left, vmdq_vecs_wanted);
 
-		/* if we're short on vectors for what's desired, we limit
-		 * the queues per vmdq.  If this is still more than are
-		 * available, the user will need to change the number of
-		 * queues/vectors used by the PF later with the ethtool
-		 * channels command
-		 */
-		if (vmdq_vecs < vmdq_vecs_wanted)
-			pf->num_vmdq_qps = 1;
-		pf->num_vmdq_msix = pf->num_vmdq_qps;
+		if (!vectors_left) {
+			pf->num_vmdq_msix = 0;
+			pf->num_vmdq_qps = 0;
+		} else {
+			/* if we're short on vectors for what's desired, we limit
+			 * the queues per vmdq.  If this is still more than are
+			 * available, the user will need to change the number of
+			 * queues/vectors used by the PF later with the ethtool
+			 * channels command
+			 */
+			if (vmdq_vecs < vmdq_vecs_wanted)
+				pf->num_vmdq_qps = 1;
+			pf->num_vmdq_msix = pf->num_vmdq_qps;
 
-		v_budget += vmdq_vecs;
-		vectors_left -= vmdq_vecs;
+			v_budget += vmdq_vecs;
+			vectors_left -= vmdq_vecs;
+		}
 	}
 
 	pf->msix_entries = kcalloc(v_budget, sizeof(struct msix_entry),
-- 
2.7.4

^ permalink raw reply related

* Re: [v3 PATCH 0/2] rhashtable: rhashtable with duplicate objects
From: Johannes Berg @ 2016-09-19 11:32 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Thomas Graf,
	tom-BjP2VixgY4xUbtYUoyoikg, Ben Greear
In-Reply-To: <1474283023.6544.0.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

On Mon, 2016-09-19 at 13:03 +0200, Johannes Berg wrote:
> On Mon, 2016-09-19 at 18:58 +0800, Herbert Xu wrote:
> > 
> > v3 fixes a bug in the remove path that causes the element count
> > to decrease when it shouldn't, leading to a gigantic hash table
> > when it underflows.
> > 
> Ok, with the BUG_ON() thrown in, this works in the test that was
> failing before. I'll run the entire suite again over lunch.
> 

Ok, the entire test suite passed (with the BUG_ON, but hey).

Dave, let me know what you want to do (or have done, as it may be).

Thanks,
johannes

^ permalink raw reply

* Re: [PATCH net] xfrm: Fix memory leak of aead algorithm name
From: Steffen Klassert @ 2016-09-19 11:07 UTC (permalink / raw)
  To: Ilan Tayari; +Cc: Herbert Xu, netdev@vger.kernel.org
In-Reply-To: <AM4PR0501MB19404B000E4D3A2CF3CB80DEDBF50@AM4PR0501MB1940.eurprd05.prod.outlook.com>

On Sun, Sep 18, 2016 at 07:42:53AM +0000, Ilan Tayari wrote:
> commit 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
> introduced aead. The function attach_aead kmemdup()s the algorithm
> name during xfrm_state_construct().
> However this memory is never freed.
> Implementation has since been slightly modified in
> commit ee5c23176fcc ("xfrm: Clone states properly on migration")
> without resolving this leak.
> This patch adds a kfree() call for the aead algorithm name.
> 
> Fixes: 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
> Signed-off-by: Ilan Tayari <ilant@mellanox.com>

Applied to the ipsec tree, thanks Ilan!

^ permalink raw reply

* Re: [v3 PATCH 0/2] rhashtable: rhashtable with duplicate objects
From: Johannes Berg @ 2016-09-19 11:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, netdev, linux-wireless, Thomas Graf, tom,
	Ben Greear
In-Reply-To: <20160919105854.GA12892@gondor.apana.org.au>

On Mon, 2016-09-19 at 18:58 +0800, Herbert Xu wrote:
> v3 fixes a bug in the remove path that causes the element count
> to decrease when it shouldn't, leading to a gigantic hash table
> when it underflows.
> 
Ok, with the BUG_ON() thrown in, this works in the test that was
failing before. I'll run the entire suite again over lunch.

johannes

^ permalink raw reply

* [v3 PATCH 2/2] mac80211: Use rhltable instead of rhashtable
From: Herbert Xu @ 2016-09-19 11:00 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller, netdev, linux-wireless,
	Thomas Graf, tom, Ben Greear
In-Reply-To: <20160919105854.GA12892@gondor.apana.org.au>

mac80211 currently uses rhashtable with insecure_elasticity set
to true.  The latter is because of duplicate objects.  What's
more, mac80211 walks the rhashtable chains by hand which is broken
as rhashtable may contain multiple tables due to resizing or
rehashing.

This patch fixes it by converting it to the newly added rhltable
interface which is designed for use with duplicate objects.

With rhltable a lookup returns a list of objects instead of a
single one.  This is then fed into the existing for_each_sta_info
macro.

This patch also deletes the sta_addr_hash function since rhashtable
defaults to jhash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/mac80211/ieee80211_i.h |    2 -
 net/mac80211/rx.c          |    7 +-----
 net/mac80211/sta_info.c    |   52 ++++++++++++++++++---------------------------
 net/mac80211/sta_info.h    |   19 ++++++----------
 net/mac80211/status.c      |    7 +-----
 5 files changed, 33 insertions(+), 54 deletions(-)

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index f56d342..1a52cd4 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1208,7 +1208,7 @@ struct ieee80211_local {
 	spinlock_t tim_lock;
 	unsigned long num_sta;
 	struct list_head sta_list;
-	struct rhashtable sta_hash;
+	struct rhltable sta_hash;
 	struct timer_list sta_cleanup;
 	int sta_generation;
 
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index 9dce3b1..5e26dc6 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -3940,7 +3940,7 @@ static void __ieee80211_rx_handle_packet(struct ieee80211_hw *hw,
 	__le16 fc;
 	struct ieee80211_rx_data rx;
 	struct ieee80211_sub_if_data *prev;
-	struct rhash_head *tmp;
+	struct rhlist_head *tmp;
 	int err = 0;
 
 	fc = ((struct ieee80211_hdr *)skb->data)->frame_control;
@@ -3983,13 +3983,10 @@ static void __ieee80211_rx_handle_packet(struct ieee80211_hw *hw,
 		goto out;
 	} else if (ieee80211_is_data(fc)) {
 		struct sta_info *sta, *prev_sta;
-		const struct bucket_table *tbl;
 
 		prev_sta = NULL;
 
-		tbl = rht_dereference_rcu(local->sta_hash.tbl, &local->sta_hash);
-
-		for_each_sta_info(local, tbl, hdr->addr2, sta, tmp) {
+		for_each_sta_info(local, hdr->addr2, sta, tmp) {
 			if (!prev_sta) {
 				prev_sta = sta;
 				continue;
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 19f14c9..198d0bd 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -67,12 +67,10 @@
 
 static const struct rhashtable_params sta_rht_params = {
 	.nelem_hint = 3, /* start small */
-	.insecure_elasticity = true, /* Disable chain-length checks. */
 	.automatic_shrinking = true,
 	.head_offset = offsetof(struct sta_info, hash_node),
 	.key_offset = offsetof(struct sta_info, addr),
 	.key_len = ETH_ALEN,
-	.hashfn = sta_addr_hash,
 	.max_size = CONFIG_MAC80211_STA_HASH_MAX_SIZE,
 };
 
@@ -80,8 +78,8 @@ static const struct rhashtable_params sta_rht_params = {
 static int sta_info_hash_del(struct ieee80211_local *local,
 			     struct sta_info *sta)
 {
-	return rhashtable_remove_fast(&local->sta_hash, &sta->hash_node,
-				      sta_rht_params);
+	return rhltable_remove(&local->sta_hash, &sta->hash_node,
+			       sta_rht_params);
 }
 
 static void __cleanup_single_sta(struct sta_info *sta)
@@ -157,19 +155,22 @@ static void cleanup_single_sta(struct sta_info *sta)
 	sta_info_free(local, sta);
 }
 
+struct rhlist_head *sta_info_hash_lookup(struct ieee80211_local *local,
+					 const u8 *addr)
+{
+	return rhltable_lookup(&local->sta_hash, addr, sta_rht_params);
+}
+
 /* protected by RCU */
 struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata,
 			      const u8 *addr)
 {
 	struct ieee80211_local *local = sdata->local;
+	struct rhlist_head *tmp;
 	struct sta_info *sta;
-	struct rhash_head *tmp;
-	const struct bucket_table *tbl;
 
 	rcu_read_lock();
-	tbl = rht_dereference_rcu(local->sta_hash.tbl, &local->sta_hash);
-
-	for_each_sta_info(local, tbl, addr, sta, tmp) {
+	for_each_sta_info(local, addr, sta, tmp) {
 		if (sta->sdata == sdata) {
 			rcu_read_unlock();
 			/* this is safe as the caller must already hold
@@ -190,14 +191,11 @@ struct sta_info *sta_info_get_bss(struct ieee80211_sub_if_data *sdata,
 				  const u8 *addr)
 {
 	struct ieee80211_local *local = sdata->local;
+	struct rhlist_head *tmp;
 	struct sta_info *sta;
-	struct rhash_head *tmp;
-	const struct bucket_table *tbl;
 
 	rcu_read_lock();
-	tbl = rht_dereference_rcu(local->sta_hash.tbl, &local->sta_hash);
-
-	for_each_sta_info(local, tbl, addr, sta, tmp) {
+	for_each_sta_info(local, addr, sta, tmp) {
 		if (sta->sdata == sdata ||
 		    (sta->sdata->bss && sta->sdata->bss == sdata->bss)) {
 			rcu_read_unlock();
@@ -263,8 +261,8 @@ void sta_info_free(struct ieee80211_local *local, struct sta_info *sta)
 static int sta_info_hash_add(struct ieee80211_local *local,
 			     struct sta_info *sta)
 {
-	return rhashtable_insert_fast(&local->sta_hash, &sta->hash_node,
-				      sta_rht_params);
+	return rhltable_insert(&local->sta_hash, &sta->hash_node,
+			       sta_rht_params);
 }
 
 static void sta_deliver_ps_frames(struct work_struct *wk)
@@ -450,9 +448,9 @@ static int sta_info_insert_check(struct sta_info *sta)
 		    is_multicast_ether_addr(sta->sta.addr)))
 		return -EINVAL;
 
-	/* Strictly speaking this isn't necessary as we hold the mutex, but
-	 * the rhashtable code can't really deal with that distinction. We
-	 * do require the mutex for correctness though.
+	/* The RCU read lock is required by rhashtable due to
+	 * asynchronous resize/rehash.  We also require the mutex
+	 * for correctness.
 	 */
 	rcu_read_lock();
 	lockdep_assert_held(&sdata->local->sta_mtx);
@@ -1040,16 +1038,11 @@ static void sta_info_cleanup(unsigned long data)
 		  round_jiffies(jiffies + STA_INFO_CLEANUP_INTERVAL));
 }
 
-u32 sta_addr_hash(const void *key, u32 length, u32 seed)
-{
-	return jhash(key, ETH_ALEN, seed);
-}
-
 int sta_info_init(struct ieee80211_local *local)
 {
 	int err;
 
-	err = rhashtable_init(&local->sta_hash, &sta_rht_params);
+	err = rhltable_init(&local->sta_hash, &sta_rht_params);
 	if (err)
 		return err;
 
@@ -1065,7 +1058,7 @@ int sta_info_init(struct ieee80211_local *local)
 void sta_info_stop(struct ieee80211_local *local)
 {
 	del_timer_sync(&local->sta_cleanup);
-	rhashtable_destroy(&local->sta_hash);
+	rhltable_destroy(&local->sta_hash);
 }
 
 
@@ -1135,17 +1128,14 @@ struct ieee80211_sta *ieee80211_find_sta_by_ifaddr(struct ieee80211_hw *hw,
 						   const u8 *localaddr)
 {
 	struct ieee80211_local *local = hw_to_local(hw);
+	struct rhlist_head *tmp;
 	struct sta_info *sta;
-	struct rhash_head *tmp;
-	const struct bucket_table *tbl;
-
-	tbl = rht_dereference_rcu(local->sta_hash.tbl, &local->sta_hash);
 
 	/*
 	 * Just return a random station if localaddr is NULL
 	 * ... first in list.
 	 */
-	for_each_sta_info(local, tbl, addr, sta, tmp) {
+	for_each_sta_info(local, addr, sta, tmp) {
 		if (localaddr &&
 		    !ether_addr_equal(sta->sdata->vif.addr, localaddr))
 			continue;
diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
index 0556be3..d347ab5 100644
--- a/net/mac80211/sta_info.h
+++ b/net/mac80211/sta_info.h
@@ -452,7 +452,7 @@ struct sta_info {
 	/* General information, mostly static */
 	struct list_head list, free_list;
 	struct rcu_head rcu_head;
-	struct rhash_head hash_node;
+	struct rhlist_head hash_node;
 	u8 addr[ETH_ALEN];
 	struct ieee80211_local *local;
 	struct ieee80211_sub_if_data *sdata;
@@ -635,6 +635,9 @@ rcu_dereference_protected_tid_tx(struct sta_info *sta, int tid)
  */
 #define STA_INFO_CLEANUP_INTERVAL (10 * HZ)
 
+struct rhlist_head *sta_info_hash_lookup(struct ieee80211_local *local,
+					 const u8 *addr);
+
 /*
  * Get a STA info, must be under RCU read lock.
  */
@@ -644,17 +647,9 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata,
 struct sta_info *sta_info_get_bss(struct ieee80211_sub_if_data *sdata,
 				  const u8 *addr);
 
-u32 sta_addr_hash(const void *key, u32 length, u32 seed);
-
-#define _sta_bucket_idx(_tbl, _a)					\
-	rht_bucket_index(_tbl, sta_addr_hash(_a, ETH_ALEN, (_tbl)->hash_rnd))
-
-#define for_each_sta_info(local, tbl, _addr, _sta, _tmp)		\
-	rht_for_each_entry_rcu(_sta, _tmp, tbl, 			\
-			       _sta_bucket_idx(tbl, _addr),		\
-			       hash_node)				\
-	/* compare address and run code only if it matches */		\
-	if (ether_addr_equal(_sta->addr, (_addr)))
+#define for_each_sta_info(local, _addr, _sta, _tmp)			\
+	rhl_for_each_entry_rcu(_sta, _tmp,				\
+			       sta_info_hash_lookup(local, _addr), hash_node)
 
 /*
  * Get STA info by index, BROKEN!
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index a2a6826..6361709 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -740,8 +740,8 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
 	__le16 fc;
 	struct ieee80211_supported_band *sband;
+	struct rhlist_head *tmp;
 	struct sta_info *sta;
-	struct rhash_head *tmp;
 	int retry_count;
 	int rates_idx;
 	bool send_to_cooked;
@@ -749,7 +749,6 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 	struct ieee80211_bar *bar;
 	int shift = 0;
 	int tid = IEEE80211_NUM_TIDS;
-	const struct bucket_table *tbl;
 
 	rates_idx = ieee80211_tx_get_rates(hw, info, &retry_count);
 
@@ -758,9 +757,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 	sband = local->hw.wiphy->bands[info->band];
 	fc = hdr->frame_control;
 
-	tbl = rht_dereference_rcu(local->sta_hash.tbl, &local->sta_hash);
-
-	for_each_sta_info(local, tbl, hdr->addr1, sta, tmp) {
+	for_each_sta_info(local, hdr->addr1, sta, tmp) {
 		/* skip wrong virtual interface */
 		if (!ether_addr_equal(hdr->addr2, sta->sdata->vif.addr))
 			continue;

^ permalink raw reply related

* [v3 PATCH 1/2] rhashtable: Add rhlist interface
From: Herbert Xu @ 2016-09-19 11:00 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Thomas Graf,
	tom-BjP2VixgY4xUbtYUoyoikg, Ben Greear
In-Reply-To: <20160919105854.GA12892@gondor.apana.org.au>

The insecure_elasticity setting is an ugly wart brought out by
users who need to insert duplicate objects (that is, distinct
objects with identical keys) into the same table.

In fact, those users have a much bigger problem.  Once those
duplicate objects are inserted, they don't have an interface to
find them (unless you count the walker interface which walks
over the entire table).

Some users have resorted to doing a manual walk over the hash
table which is of course broken because they don't handle the
potential existence of multiple hash tables.  The result is that
they will break sporadically when they encounter a hash table
resize/rehash.

This patch provides a way out for those users, at the expense
of an extra pointer per object.  Essentially each object is now
a list of objects carrying the same key.  The hash table will
only see the lists so nothing changes as far as rhashtable is
concerned.

To use this new interface, you need to insert a struct rhlist_head
into your objects instead of struct rhash_head.  While the hash
table is unchanged, for type-safety you'll need to use struct
rhltable instead of struct rhashtable.  All the existing interfaces
have been duplicated for rhlist, including the hash table walker.

One missing feature is nulls marking because AFAIK the only potential
user of it does not need duplicate objects.  Should anyone need
this it shouldn't be too hard to add.

Signed-off-by: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
---

 include/linux/rhashtable.h |  491 ++++++++++++++++++++++++++++++++++-----------
 lib/rhashtable.c           |  258 ++++++++++++++++++-----
 2 files changed, 583 insertions(+), 166 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index fd82584..5c132d3 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1,7 +1,7 @@
 /*
  * Resizable, Scalable, Concurrent Hash Table
  *
- * Copyright (c) 2015 Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
+ * Copyright (c) 2015-2016 Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
  * Copyright (c) 2014-2015 Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org>
  * Copyright (c) 2008-2014 Patrick McHardy <kaber-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  *
@@ -53,6 +53,11 @@ struct rhash_head {
 	struct rhash_head __rcu		*next;
 };
 
+struct rhlist_head {
+	struct rhash_head		rhead;
+	struct rhlist_head __rcu	*next;
+};
+
 /**
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
@@ -137,6 +142,7 @@ struct rhashtable_params {
  * @key_len: Key length for hashfn
  * @elasticity: Maximum chain length before rehash
  * @p: Configuration parameters
+ * @rhlist: True if this is an rhltable
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
  * @lock: Spin lock to protect walker list
@@ -147,12 +153,21 @@ struct rhashtable {
 	unsigned int			key_len;
 	unsigned int			elasticity;
 	struct rhashtable_params	p;
+	bool				rhlist;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
 	spinlock_t			lock;
 };
 
 /**
+ * struct rhltable - Hash table with duplicate objects in a list
+ * @ht: Underlying rhtable
+ */
+struct rhltable {
+	struct rhashtable ht;
+};
+
+/**
  * struct rhashtable_walker - Hash table walker
  * @list: List entry on list of walkers
  * @tbl: The table that we were walking over
@@ -163,9 +178,10 @@ struct rhashtable_walker {
 };
 
 /**
- * struct rhashtable_iter - Hash table iterator, fits into netlink cb
+ * struct rhashtable_iter - Hash table iterator
  * @ht: Table to iterate through
  * @p: Current pointer
+ * @list: Current hash list pointer
  * @walker: Associated rhashtable walker
  * @slot: Current slot
  * @skip: Number of entries to skip in slot
@@ -173,6 +189,7 @@ struct rhashtable_walker {
 struct rhashtable_iter {
 	struct rhashtable *ht;
 	struct rhash_head *p;
+	struct rhlist_head *list;
 	struct rhashtable_walker walker;
 	unsigned int slot;
 	unsigned int skip;
@@ -339,13 +356,11 @@ static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
 
 int rhashtable_init(struct rhashtable *ht,
 		    const struct rhashtable_params *params);
+int rhltable_init(struct rhltable *hlt,
+		  const struct rhashtable_params *params);
 
-struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
-					    const void *key,
-					    struct rhash_head *obj,
-					    struct bucket_table *old_tbl,
-					    void **data);
-int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);
+void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
+			     struct rhash_head *obj);
 
 void rhashtable_walk_enter(struct rhashtable *ht,
 			   struct rhashtable_iter *iter);
@@ -507,6 +522,31 @@ void rhashtable_destroy(struct rhashtable *ht);
 	rht_for_each_entry_rcu_continue(tpos, pos, (tbl)->buckets[hash],\
 					tbl, hash, member)
 
+/**
+ * rhl_for_each_rcu - iterate over rcu hash table list
+ * @pos:	the &struct rlist_head to use as a loop cursor.
+ * @list:	the head of the list
+ *
+ * This hash chain list-traversal primitive should be used on the
+ * list returned by rhltable_lookup.
+ */
+#define rhl_for_each_rcu(pos, list)					\
+	for (pos = list; pos; pos = rcu_dereference_raw(pos->next))
+
+/**
+ * rhl_for_each_entry_rcu - iterate over rcu hash table list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rlist_head to use as a loop cursor.
+ * @list:	the head of the list
+ * @member:	name of the &struct rlist_head within the hashable struct.
+ *
+ * This hash chain list-traversal primitive should be used on the
+ * list returned by rhltable_lookup.
+ */
+#define rhl_for_each_entry_rcu(tpos, pos, list, member)			\
+	for (pos = list; pos && rht_entry(tpos, pos, member);		\
+	     pos = rcu_dereference_raw(pos->next))
+
 static inline int rhashtable_compare(struct rhashtable_compare_arg *arg,
 				     const void *obj)
 {
@@ -516,18 +556,8 @@ static inline int rhashtable_compare(struct rhashtable_compare_arg *arg,
 	return memcmp(ptr + ht->p.key_offset, arg->key, ht->p.key_len);
 }
 
-/**
- * rhashtable_lookup_fast - search hash table, inlined version
- * @ht:		hash table
- * @key:	the pointer to the key
- * @params:	hash table parameters
- *
- * Computes the hash value for the key and traverses the bucket chain looking
- * for a entry with an identical key. The first matching entry is returned.
- *
- * Returns the first entry on which the compare function returned true.
- */
-static inline void *rhashtable_lookup_fast(
+/* Internal function, do not use. */
+static inline struct rhash_head *__rhashtable_lookup(
 	struct rhashtable *ht, const void *key,
 	const struct rhashtable_params params)
 {
@@ -539,8 +569,6 @@ static inline void *rhashtable_lookup_fast(
 	struct rhash_head *he;
 	unsigned int hash;
 
-	rcu_read_lock();
-
 	tbl = rht_dereference_rcu(ht->tbl, ht);
 restart:
 	hash = rht_key_hashfn(ht, tbl, key, params);
@@ -549,8 +577,7 @@ restart:
 		    params.obj_cmpfn(&arg, rht_obj(ht, he)) :
 		    rhashtable_compare(&arg, rht_obj(ht, he)))
 			continue;
-		rcu_read_unlock();
-		return rht_obj(ht, he);
+		return he;
 	}
 
 	/* Ensure we see any new tables. */
@@ -559,96 +586,165 @@ restart:
 	tbl = rht_dereference_rcu(tbl->future_tbl, ht);
 	if (unlikely(tbl))
 		goto restart;
-	rcu_read_unlock();
 
 	return NULL;
 }
 
+/**
+ * rhashtable_lookup - search hash table
+ * @ht:		hash table
+ * @key:	the pointer to the key
+ * @params:	hash table parameters
+ *
+ * Computes the hash value for the key and traverses the bucket chain looking
+ * for a entry with an identical key. The first matching entry is returned.
+ *
+ * This must only be called under the RCU read lock.
+ *
+ * Returns the first entry on which the compare function returned true.
+ */
+static inline void *rhashtable_lookup(
+	struct rhashtable *ht, const void *key,
+	const struct rhashtable_params params)
+{
+	struct rhash_head *he = __rhashtable_lookup(ht, key, params);
+
+	return he ? rht_obj(ht, he) : NULL;
+}
+
+/**
+ * rhashtable_lookup_fast - search hash table, without RCU read lock
+ * @ht:		hash table
+ * @key:	the pointer to the key
+ * @params:	hash table parameters
+ *
+ * Computes the hash value for the key and traverses the bucket chain looking
+ * for a entry with an identical key. The first matching entry is returned.
+ *
+ * Only use this function when you have other mechanisms guaranteeing
+ * that the object won't go away after the RCU read lock is released.
+ *
+ * Returns the first entry on which the compare function returned true.
+ */
+static inline void *rhashtable_lookup_fast(
+	struct rhashtable *ht, const void *key,
+	const struct rhashtable_params params)
+{
+	void *obj;
+
+	rcu_read_lock();
+	obj = rhashtable_lookup(ht, key, params);
+	rcu_read_unlock();
+
+	return obj;
+}
+
+/**
+ * rhltable_lookup - search hash list table
+ * @hlt:	hash table
+ * @key:	the pointer to the key
+ * @params:	hash table parameters
+ *
+ * Computes the hash value for the key and traverses the bucket chain looking
+ * for a entry with an identical key.  All matching entries are returned
+ * in a list.
+ *
+ * This must only be called under the RCU read lock.
+ *
+ * Returns the list of entries that match the given key.
+ */
+static inline struct rhlist_head *rhltable_lookup(
+	struct rhltable *hlt, const void *key,
+	const struct rhashtable_params params)
+{
+	struct rhash_head *he = __rhashtable_lookup(&hlt->ht, key, params);
+
+	return he ? container_of(he, struct rhlist_head, rhead) : NULL;
+}
+
 /* Internal function, please use rhashtable_insert_fast() instead. This
  * function returns the existing element already in hashes in there is a clash,
  * otherwise it returns an error via ERR_PTR().
  */
 static inline void *__rhashtable_insert_fast(
 	struct rhashtable *ht, const void *key, struct rhash_head *obj,
-	const struct rhashtable_params params)
+	const struct rhashtable_params params, bool rhlist)
 {
 	struct rhashtable_compare_arg arg = {
 		.ht = ht,
 		.key = key,
 	};
-	struct bucket_table *tbl, *new_tbl;
+	struct rhash_head __rcu **pprev;
+	struct bucket_table *tbl;
 	struct rhash_head *head;
 	spinlock_t *lock;
-	unsigned int elasticity;
 	unsigned int hash;
-	void *data = NULL;
-	int err;
+	int elasticity;
+	void *data;
 
-restart:
 	rcu_read_lock();
 
 	tbl = rht_dereference_rcu(ht->tbl, ht);
+	hash = rht_head_hashfn(ht, tbl, obj, params);
+	lock = rht_bucket_lock(tbl, hash);
+	spin_lock_bh(lock);
 
-	/* All insertions must grab the oldest table containing
-	 * the hashed bucket that is yet to be rehashed.
-	 */
-	for (;;) {
-		hash = rht_head_hashfn(ht, tbl, obj, params);
-		lock = rht_bucket_lock(tbl, hash);
-		spin_lock_bh(lock);
-
-		if (tbl->rehash <= hash)
-			break;
-
+	if (unlikely(rht_dereference_bucket(tbl->future_tbl, tbl, hash))) {
+slow_path:
 		spin_unlock_bh(lock);
-		tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+		rcu_read_unlock();
+		return rhashtable_insert_slow(ht, key, obj);
 	}
 
-	new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
-	if (unlikely(new_tbl)) {
-		tbl = rhashtable_insert_slow(ht, key, obj, new_tbl, &data);
-		if (!IS_ERR_OR_NULL(tbl))
-			goto slow_path;
+	elasticity = ht->elasticity;
+	pprev = &tbl->buckets[hash];
+	rht_for_each(head, tbl, hash) {
+		struct rhlist_head *plist;
+		struct rhlist_head *list;
+
+		elasticity--;
+		if (!key ||
+		    (params.obj_cmpfn ?
+		     params.obj_cmpfn(&arg, rht_obj(ht, head)) :
+		     rhashtable_compare(&arg, rht_obj(ht, head))))
+			continue;
+
+		data = rht_obj(ht, head);
 
-		err = PTR_ERR(tbl);
-		if (err == -EEXIST)
-			err = 0;
+		if (!rhlist)
+			goto out;
 
-		goto out;
-	}
 
-	err = -E2BIG;
-	if (unlikely(rht_grow_above_max(ht, tbl)))
-		goto out;
+		list = container_of(obj, struct rhlist_head, rhead);
+		plist = container_of(head, struct rhlist_head, rhead);
 
-	if (unlikely(rht_grow_above_100(ht, tbl))) {
-slow_path:
-		spin_unlock_bh(lock);
-		err = rhashtable_insert_rehash(ht, tbl);
-		rcu_read_unlock();
-		if (err)
-			return ERR_PTR(err);
+		RCU_INIT_POINTER(list->next, plist);
+		head = rht_dereference_bucket(head->next, tbl, hash);
+		RCU_INIT_POINTER(list->rhead.next, head);
+		rcu_assign_pointer(*pprev, obj);
 
-		goto restart;
+		goto good;
 	}
 
-	err = 0;
-	elasticity = ht->elasticity;
-	rht_for_each(head, tbl, hash) {
-		if (key &&
-		    unlikely(!(params.obj_cmpfn ?
-			       params.obj_cmpfn(&arg, rht_obj(ht, head)) :
-			       rhashtable_compare(&arg, rht_obj(ht, head))))) {
-			data = rht_obj(ht, head);
-			goto out;
-		}
-		if (!--elasticity)
-			goto slow_path;
-	}
+	if (elasticity <= 0)
+		goto slow_path;
+
+	data = ERR_PTR(-E2BIG);
+	if (unlikely(rht_grow_above_max(ht, tbl)))
+		goto out;
+
+	if (unlikely(rht_grow_above_100(ht, tbl)))
+		goto slow_path;
 
 	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
 
 	RCU_INIT_POINTER(obj->next, head);
+	if (rhlist) {
+		struct rhlist_head *list;
+
+		list = container_of(obj, struct rhlist_head, rhead);
+		RCU_INIT_POINTER(list->next, NULL);
+	}
 
 	rcu_assign_pointer(tbl->buckets[hash], obj);
 
@@ -656,11 +752,14 @@ slow_path:
 	if (rht_grow_above_75(ht, tbl))
 		schedule_work(&ht->run_work);
 
+good:
+	data = NULL;
+
 out:
 	spin_unlock_bh(lock);
 	rcu_read_unlock();
 
-	return err ? ERR_PTR(err) : data;
+	return data;
 }
 
 /**
@@ -685,7 +784,7 @@ static inline int rhashtable_insert_fast(
 {
 	void *ret;
 
-	ret = __rhashtable_insert_fast(ht, NULL, obj, params);
+	ret = __rhashtable_insert_fast(ht, NULL, obj, params, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -693,6 +792,58 @@ static inline int rhashtable_insert_fast(
 }
 
 /**
+ * rhltable_insert_key - insert object into hash list table
+ * @hlt:	hash list table
+ * @key:	the pointer to the key
+ * @list:	pointer to hash list head inside object
+ * @params:	hash table parameters
+ *
+ * Will take a per bucket spinlock to protect against mutual mutations
+ * on the same bucket. Multiple insertions may occur in parallel unless
+ * they map to the same bucket lock.
+ *
+ * It is safe to call this function from atomic context.
+ *
+ * Will trigger an automatic deferred table resizing if the size grows
+ * beyond the watermark indicated by grow_decision() which can be passed
+ * to rhashtable_init().
+ */
+static inline int rhltable_insert_key(
+	struct rhltable *hlt, const void *key, struct rhlist_head *list,
+	const struct rhashtable_params params)
+{
+	return PTR_ERR(__rhashtable_insert_fast(&hlt->ht, key, &list->rhead,
+						params, true));
+}
+
+/**
+ * rhltable_insert - insert object into hash list table
+ * @hlt:	hash list table
+ * @list:	pointer to hash list head inside object
+ * @params:	hash table parameters
+ *
+ * Will take a per bucket spinlock to protect against mutual mutations
+ * on the same bucket. Multiple insertions may occur in parallel unless
+ * they map to the same bucket lock.
+ *
+ * It is safe to call this function from atomic context.
+ *
+ * Will trigger an automatic deferred table resizing if the size grows
+ * beyond the watermark indicated by grow_decision() which can be passed
+ * to rhashtable_init().
+ */
+static inline int rhltable_insert(
+	struct rhltable *hlt, struct rhlist_head *list,
+	const struct rhashtable_params params)
+{
+	const char *key = rht_obj(&hlt->ht, &list->rhead);
+
+	key += params.key_offset;
+
+	return rhltable_insert_key(hlt, key, list, params);
+}
+
+/**
  * rhashtable_lookup_insert_fast - lookup and insert object into hash table
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
@@ -722,7 +873,8 @@ static inline int rhashtable_lookup_insert_fast(
 
 	BUG_ON(ht->p.obj_hashfn);
 
-	ret = __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj, params);
+	ret = __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj, params,
+				       false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -759,7 +911,7 @@ static inline int rhashtable_lookup_insert_key(
 
 	BUG_ON(!ht->p.obj_hashfn || !key);
 
-	ret = __rhashtable_insert_fast(ht, key, obj, params);
+	ret = __rhashtable_insert_fast(ht, key, obj, params, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -783,13 +935,14 @@ static inline void *rhashtable_lookup_get_insert_key(
 {
 	BUG_ON(!ht->p.obj_hashfn || !key);
 
-	return __rhashtable_insert_fast(ht, key, obj, params);
+	return __rhashtable_insert_fast(ht, key, obj, params, false);
 }
 
 /* Internal function, please use rhashtable_remove_fast() instead */
-static inline int __rhashtable_remove_fast(
+static inline int __rhashtable_remove_fast_one(
 	struct rhashtable *ht, struct bucket_table *tbl,
-	struct rhash_head *obj, const struct rhashtable_params params)
+	struct rhash_head *obj, const struct rhashtable_params params,
+	bool rhlist)
 {
 	struct rhash_head __rcu **pprev;
 	struct rhash_head *he;
@@ -804,39 +957,66 @@ static inline int __rhashtable_remove_fast(
 
 	pprev = &tbl->buckets[hash];
 	rht_for_each(he, tbl, hash) {
+		struct rhlist_head *list;
+
+		list = container_of(he, struct rhlist_head, rhead);
+
 		if (he != obj) {
+			struct rhlist_head __rcu **lpprev;
+
 			pprev = &he->next;
-			continue;
+
+			if (!rhlist)
+				continue;
+
+			do {
+				lpprev = &list->next;
+				list = rht_dereference_bucket(list->next,
+							      tbl, hash);
+			} while (list && obj != &list->rhead);
+
+			if (!list)
+				continue;
+
+			list = rht_dereference_bucket(list->next, tbl, hash);
+			RCU_INIT_POINTER(*lpprev, list);
+			err = 0;
+			break;
 		}
 
-		rcu_assign_pointer(*pprev, obj->next);
-		err = 0;
+		obj = rht_dereference_bucket(obj->next, tbl, hash);
+		err = 1;
+
+		if (rhlist) {
+			list = rht_dereference_bucket(list->next, tbl, hash);
+			if (list) {
+				RCU_INIT_POINTER(list->rhead.next, obj);
+				obj = &list->rhead;
+				err = 0;
+			}
+		}
+
+		rcu_assign_pointer(*pprev, obj);
 		break;
 	}
 
 	spin_unlock_bh(lock);
 
+	if (err > 0) {
+		atomic_dec(&ht->nelems);
+		if (unlikely(ht->p.automatic_shrinking &&
+			     rht_shrink_below_30(ht, tbl)))
+			schedule_work(&ht->run_work);
+		err = 0;
+	}
+
 	return err;
 }
 
-/**
- * rhashtable_remove_fast - remove object from hash table
- * @ht:		hash table
- * @obj:	pointer to hash head inside object
- * @params:	hash table parameters
- *
- * Since the hash chain is single linked, the removal operation needs to
- * walk the bucket chain upon removal. The removal operation is thus
- * considerable slow if the hash table is not correctly sized.
- *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
- *
- * Returns zero on success, -ENOENT if the entry could not be found.
- */
-static inline int rhashtable_remove_fast(
+/* Internal function, please use rhashtable_remove_fast() instead */
+static inline int __rhashtable_remove_fast(
 	struct rhashtable *ht, struct rhash_head *obj,
-	const struct rhashtable_params params)
+	const struct rhashtable_params params, bool rhlist)
 {
 	struct bucket_table *tbl;
 	int err;
@@ -850,24 +1030,60 @@ static inline int rhashtable_remove_fast(
 	 * visible then that guarantees the entry to still be in
 	 * the old tbl if it exists.
 	 */
-	while ((err = __rhashtable_remove_fast(ht, tbl, obj, params)) &&
+	while ((err = __rhashtable_remove_fast_one(ht, tbl, obj, params,
+						   rhlist)) &&
 	       (tbl = rht_dereference_rcu(tbl->future_tbl, ht)))
 		;
 
-	if (err)
-		goto out;
-
-	atomic_dec(&ht->nelems);
-	if (unlikely(ht->p.automatic_shrinking &&
-		     rht_shrink_below_30(ht, tbl)))
-		schedule_work(&ht->run_work);
-
-out:
 	rcu_read_unlock();
 
 	return err;
 }
 
+/**
+ * rhashtable_remove_fast - remove object from hash table
+ * @ht:		hash table
+ * @obj:	pointer to hash head inside object
+ * @params:	hash table parameters
+ *
+ * Since the hash chain is single linked, the removal operation needs to
+ * walk the bucket chain upon removal. The removal operation is thus
+ * considerable slow if the hash table is not correctly sized.
+ *
+ * Will automatically shrink the table via rhashtable_expand() if the
+ * shrink_decision function specified at rhashtable_init() returns true.
+ *
+ * Returns zero on success, -ENOENT if the entry could not be found.
+ */
+static inline int rhashtable_remove_fast(
+	struct rhashtable *ht, struct rhash_head *obj,
+	const struct rhashtable_params params)
+{
+	return __rhashtable_remove_fast(ht, obj, params, false);
+}
+
+/**
+ * rhltable_remove - remove object from hash list table
+ * @hlt:	hash list table
+ * @list:	pointer to hash list head inside object
+ * @params:	hash table parameters
+ *
+ * Since the hash chain is single linked, the removal operation needs to
+ * walk the bucket chain upon removal. The removal operation is thus
+ * considerable slow if the hash table is not correctly sized.
+ *
+ * Will automatically shrink the table via rhashtable_expand() if the
+ * shrink_decision function specified at rhashtable_init() returns true.
+ *
+ * Returns zero on success, -ENOENT if the entry could not be found.
+ */
+static inline int rhltable_remove(
+	struct rhltable *hlt, struct rhlist_head *list,
+	const struct rhashtable_params params)
+{
+	return __rhashtable_remove_fast(&hlt->ht, &list->rhead, params, true);
+}
+
 /* Internal function, please use rhashtable_replace_fast() instead */
 static inline int __rhashtable_replace_fast(
 	struct rhashtable *ht, struct bucket_table *tbl,
@@ -958,4 +1174,51 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
 	return 0;
 }
 
+/**
+ * rhltable_walk_enter - Initialise an iterator
+ * @hlt:	Table to walk over
+ * @iter:	Hash table Iterator
+ *
+ * This function prepares a hash table walk.
+ *
+ * Note that if you restart a walk after rhashtable_walk_stop you
+ * may see the same object twice.  Also, you may miss objects if
+ * there are removals in between rhashtable_walk_stop and the next
+ * call to rhashtable_walk_start.
+ *
+ * For a completely stable walk you should construct your own data
+ * structure outside the hash table.
+ *
+ * This function may sleep so you must not call it from interrupt
+ * context or with spin locks held.
+ *
+ * You must call rhashtable_walk_exit after this function returns.
+ */
+static inline void rhltable_walk_enter(struct rhltable *hlt,
+				       struct rhashtable_iter *iter)
+{
+	return rhashtable_walk_enter(&hlt->ht, iter);
+}
+
+/**
+ * rhltable_free_and_destroy - free elements and destroy hash list table
+ * @hlt:	the hash list table to destroy
+ * @free_fn:	callback to release resources of element
+ * @arg:	pointer passed to free_fn
+ *
+ * See documentation for rhashtable_free_and_destroy.
+ */
+static inline void rhltable_free_and_destroy(struct rhltable *hlt,
+					     void (*free_fn)(void *ptr,
+							     void *arg),
+					     void *arg)
+{
+	return rhashtable_free_and_destroy(&hlt->ht, free_fn, arg);
+}
+
+static inline void rhltable_destroy(struct rhltable *hlt)
+{
+	return rhltable_free_and_destroy(hlt, NULL, NULL);
+}
+
 #endif /* _LINUX_RHASHTABLE_H */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 06c2872..32d0ad0 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -378,22 +378,8 @@ static void rht_deferred_worker(struct work_struct *work)
 		schedule_work(&ht->run_work);
 }
 
-static bool rhashtable_check_elasticity(struct rhashtable *ht,
-					struct bucket_table *tbl,
-					unsigned int hash)
-{
-	unsigned int elasticity = ht->elasticity;
-	struct rhash_head *head;
-
-	rht_for_each(head, tbl, hash)
-		if (!--elasticity)
-			return true;
-
-	return false;
-}
-
-int rhashtable_insert_rehash(struct rhashtable *ht,
-			     struct bucket_table *tbl)
+static int rhashtable_insert_rehash(struct rhashtable *ht,
+				    struct bucket_table *tbl)
 {
 	struct bucket_table *old_tbl;
 	struct bucket_table *new_tbl;
@@ -439,57 +425,165 @@ fail:
 
 	return err;
 }
-EXPORT_SYMBOL_GPL(rhashtable_insert_rehash);
 
-struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
-					    const void *key,
-					    struct rhash_head *obj,
-					    struct bucket_table *tbl,
-					    void **data)
+static void *rhashtable_lookup_one(struct rhashtable *ht,
+				   struct bucket_table *tbl, unsigned int hash,
+				   const void *key, struct rhash_head *obj)
 {
+	struct rhashtable_compare_arg arg = {
+		.ht = ht,
+		.key = key,
+	};
+	struct rhash_head __rcu **pprev;
 	struct rhash_head *head;
-	unsigned int hash;
-	int err;
+	int elasticity;
 
-	tbl = rhashtable_last_table(ht, tbl);
-	hash = head_hashfn(ht, tbl, obj);
-	spin_lock_nested(rht_bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
-
-	err = -EEXIST;
-	if (key) {
-		*data = rhashtable_lookup_fast(ht, key, ht->p);
-		if (*data)
-			goto exit;
+	elasticity = ht->elasticity;
+	pprev = &tbl->buckets[hash];
+	rht_for_each(head, tbl, hash) {
+		struct rhlist_head *list;
+		struct rhlist_head *plist;
+
+		elasticity--;
+		if (!key ||
+		    (ht->p.obj_cmpfn ?
+		     ht->p.obj_cmpfn(&arg, rht_obj(ht, head)) :
+		     rhashtable_compare(&arg, rht_obj(ht, head))))
+			continue;
+
+		if (!ht->rhlist)
+			return rht_obj(ht, head);
+
+		list = container_of(obj, struct rhlist_head, rhead);
+		plist = container_of(head, struct rhlist_head, rhead);
+
+		RCU_INIT_POINTER(list->next, plist);
+		head = rht_dereference_bucket(head->next, tbl, hash);
+		RCU_INIT_POINTER(list->rhead.next, head);
+		rcu_assign_pointer(*pprev, obj);
+
+		return NULL;
 	}
 
-	err = -E2BIG;
-	if (unlikely(rht_grow_above_max(ht, tbl)))
-		goto exit;
+	if (elasticity <= 0)
+		return ERR_PTR(-EAGAIN);
+
+	return ERR_PTR(-ENOENT);
+}
+
+static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
+						  struct bucket_table *tbl,
+						  unsigned int hash,
+						  struct rhash_head *obj,
+						  void *data)
+{
+	struct bucket_table *new_tbl;
+	struct rhash_head *head;
+
+	if (!IS_ERR_OR_NULL(data))
+		return ERR_PTR(-EEXIST);
 
-	err = -EAGAIN;
-	if (rhashtable_check_elasticity(ht, tbl, hash) ||
-	    rht_grow_above_100(ht, tbl))
-		goto exit;
+	if (PTR_ERR(data) != -EAGAIN && PTR_ERR(data) != -ENOENT)
+		return ERR_CAST(data);
 
-	err = 0;
+	new_tbl = rcu_dereference(tbl->future_tbl);
+	if (new_tbl)
+		return new_tbl;
+
+	if (PTR_ERR(data) != -ENOENT)
+		return ERR_CAST(data);
+
+	if (unlikely(rht_grow_above_max(ht, tbl)))
+		return ERR_PTR(-E2BIG);
+
+	if (unlikely(rht_grow_above_100(ht, tbl)))
+		return ERR_PTR(-EAGAIN);
 
 	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
 
 	RCU_INIT_POINTER(obj->next, head);
+	if (ht->rhlist) {
+		struct rhlist_head *list;
+
+		list = container_of(obj, struct rhlist_head, rhead);
+		RCU_INIT_POINTER(list->next, NULL);
+	}
 
 	rcu_assign_pointer(tbl->buckets[hash], obj);
 
 	atomic_inc(&ht->nelems);
+	if (rht_grow_above_75(ht, tbl))
+		schedule_work(&ht->run_work);
 
-exit:
-	spin_unlock(rht_bucket_lock(tbl, hash));
+	return NULL;
+}
 
-	if (err == 0)
-		return NULL;
-	else if (err == -EAGAIN)
-		return tbl;
-	else
-		return ERR_PTR(err);
+static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
+				   struct rhash_head *obj)
+{
+	struct bucket_table *new_tbl;
+	struct bucket_table *tbl;
+	unsigned int hash;
+	spinlock_t *lock;
+	void *data;
+
+	tbl = rcu_dereference(ht->tbl);
+
+	/* All insertions must grab the oldest table containing
+	 * the hashed bucket that is yet to be rehashed.
+	 */
+	for (;;) {
+		hash = rht_head_hashfn(ht, tbl, obj, ht->p);
+		lock = rht_bucket_lock(tbl, hash);
+		spin_lock_bh(lock);
+
+		if (tbl->rehash <= hash)
+			break;
+
+		spin_unlock_bh(lock);
+		tbl = rcu_dereference(tbl->future_tbl);
+	}
+
+	data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
+	new_tbl = rhashtable_insert_one(ht, tbl, hash, obj, data);
+	if (PTR_ERR(new_tbl) != -EEXIST)
+		data = ERR_CAST(new_tbl);
+
+	while (!IS_ERR_OR_NULL(new_tbl)) {
+		tbl = new_tbl;
+		hash = rht_head_hashfn(ht, tbl, obj, ht->p);
+		spin_lock_nested(rht_bucket_lock(tbl, hash),
+				 SINGLE_DEPTH_NESTING);
+
+		data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
+		new_tbl = rhashtable_insert_one(ht, tbl, hash, obj, data);
+		if (PTR_ERR(new_tbl) != -EEXIST)
+			data = ERR_CAST(new_tbl);
+
+		spin_unlock(rht_bucket_lock(tbl, hash));
+	}
+
+	spin_unlock_bh(lock);
+
+	if (PTR_ERR(data) == -EAGAIN)
+		data = ERR_PTR(rhashtable_insert_rehash(ht, tbl) ?:
+			       -EAGAIN);
+
+	return data;
+}
+
+void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
+			     struct rhash_head *obj)
+{
+	void *data;
+
+	do {
+		rcu_read_lock();
+		data = rhashtable_try_insert(ht, key, obj);
+		rcu_read_unlock();
+	} while (PTR_ERR(data) == -EAGAIN);
+
+	return data;
 }
 EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
 
@@ -593,11 +687,16 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_start);
 void *rhashtable_walk_next(struct rhashtable_iter *iter)
 {
 	struct bucket_table *tbl = iter->walker.tbl;
+	struct rhlist_head *list = iter->list;
 	struct rhashtable *ht = iter->ht;
 	struct rhash_head *p = iter->p;
+	bool rhlist = ht->rhlist;
 
 	if (p) {
-		p = rht_dereference_bucket_rcu(p->next, tbl, iter->slot);
+		if (!rhlist || !(list = rcu_dereference(list->next))) {
+			p = rcu_dereference(p->next);
+			list = container_of(p, struct rhlist_head, rhead);
+		}
 		goto next;
 	}
 
@@ -605,6 +704,18 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 		int skip = iter->skip;
 
 		rht_for_each_rcu(p, tbl, iter->slot) {
+			if (rhlist) {
+				list = container_of(p, struct rhlist_head,
+						    rhead);
+				do {
+					if (!skip)
+						goto next;
+					skip--;
+					list = rcu_dereference(list->next);
+				} while (list);
+
+				continue;
+			}
 			if (!skip)
 				break;
 			skip--;
@@ -614,7 +725,8 @@ next:
 		if (!rht_is_a_nulls(p)) {
 			iter->skip++;
 			iter->p = p;
-			return rht_obj(ht, p);
+			iter->list = list;
+			return rht_obj(ht, rhlist ? &list->rhead : p);
 		}
 
 		iter->skip = 0;
@@ -803,6 +915,48 @@ int rhashtable_init(struct rhashtable *ht,
 EXPORT_SYMBOL_GPL(rhashtable_init);
 
 /**
+ * rhltable_init - initialize a new hash list table
+ * @hlt:	hash list table to be initialized
+ * @params:	configuration parameters
+ *
+ * Initializes a new hash list table.
+ *
+ * See documentation for rhashtable_init.
+ */
+int rhltable_init(struct rhltable *hlt, const struct rhashtable_params *params)
+{
+	int err;
+
+	/* No rhlist NULLs marking for now. */
+	if (params->nulls_base)
+		return -EINVAL;
+
+	err = rhashtable_init(&hlt->ht, params);
+	hlt->ht.rhlist = true;
+	return err;
+}
+EXPORT_SYMBOL_GPL(rhltable_init);
+
+static void rhashtable_free_one(struct rhashtable *ht, struct rhash_head *obj,
+				void (*free_fn)(void *ptr, void *arg),
+				void *arg)
+{
+	struct rhlist_head *list;
+
+	if (!ht->rhlist) {
+		free_fn(rht_obj(ht, obj), arg);
+		return;
+	}
+
+	list = container_of(obj, struct rhlist_head, rhead);
+	do {
+		obj = &list->rhead;
+		list = rht_dereference(list->next, ht);
+		free_fn(rht_obj(ht, obj), arg);
+	} while (list);
+}
+
+/**
  * rhashtable_free_and_destroy - free elements and destroy hash table
  * @ht:		the hash table to destroy
  * @free_fn:	callback to release resources of element
@@ -839,7 +993,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
 			     pos = next,
 			     next = !rht_is_a_nulls(pos) ?
 					rht_dereference(pos->next, ht) : NULL)
-				free_fn(rht_obj(ht, pos), arg);
+				rhashtable_free_one(ht, pos, free_fn, arg);
 		}
 	}
 

^ permalink raw reply related

* [v3 PATCH 0/2] rhashtable: rhashtable with duplicate objects
From: Herbert Xu @ 2016-09-19 10:58 UTC (permalink / raw)
  To: Johannes Berg
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Thomas Graf,
	tom-BjP2VixgY4xUbtYUoyoikg, Ben Greear
In-Reply-To: <20160919084056.GA11875-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

v3 fixes a bug in the remove path that causes the element count
to decrease when it shouldn't, leading to a gigantic hash table
when it underflows.

v2 contains a reworked insertion slowpath to ensure that the
spinlock for the table we're inserting into is taken.

This series contains two patches.  The first adds the rhlist
interface and the second converts mac80211 to use it.  If this
works out I'll then proceed to convert the other insecure_elasticity
users over to this.

I've tested the rhlist code with test_rhashtable but I haven't
tested the mac80211 conversion.  So please give it a go and see
if it still works.

Thanks!
-- 
Email: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH net-next v6] gso: Support partial splitting at the frag_list pointer
From: Steffen Klassert @ 2016-09-19 10:58 UTC (permalink / raw)
  To: netdev; +Cc: Alexander Duyck, Eric Dumazet, Marcelo Ricardo Leitner

Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
gro may build buffers with a frag_list. This can hurt forwarding
because most NICs can't offload such packets, they need to be
segmented in software. This patch splits buffers with a frag_list
at the frag_list pointer into buffers that can be TSO offloaded.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---

Changes since v1:

- Use the assumption that all buffers in the chain excluding the last
  containing the same amount of data.

- Simplify some checks against gso partial.

- Fix the generation of IP IDs.

Changes since v2:

- Merge common code of gso partial and frag_list pointer splitting.

Changes since v3:

- Fix the checks for doing frag_list pointer splitting.

Changes since v4:

- Whitespace fix.
- Fix size calculations of the tail packet.

Changes since v5:

- Fix another size calculations of the tail packet.


 net/core/skbuff.c      | 51 +++++++++++++++++++++++++++++++++++++++-----------
 net/ipv4/af_inet.c     | 14 ++++++++++----
 net/ipv4/gre_offload.c |  6 ++++--
 net/ipv4/tcp_offload.c | 13 +++++++------
 net/ipv4/udp_offload.c |  6 ++++--
 net/ipv6/ip6_offload.c |  5 ++++-
 6 files changed, 69 insertions(+), 26 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1e329d4..7bf82a2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3097,11 +3097,31 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	sg = !!(features & NETIF_F_SG);
 	csum = !!can_checksum_protocol(features, proto);
 
-	/* GSO partial only requires that we trim off any excess that
-	 * doesn't fit into an MSS sized block, so take care of that
-	 * now.
-	 */
-	if (sg && csum && (features & NETIF_F_GSO_PARTIAL)) {
+	if (sg && csum && (mss != GSO_BY_FRAGS))  {
+		if (!(features & NETIF_F_GSO_PARTIAL)) {
+			struct sk_buff *iter;
+
+			if (!list_skb ||
+			    !net_gso_ok(features, skb_shinfo(head_skb)->gso_type))
+				goto normal;
+
+			/* Split the buffer at the frag_list pointer.
+			 * This is based on the assumption that all
+			 * buffers in the chain excluding the last
+			 * containing the same amount of data.
+			 */
+			skb_walk_frags(head_skb, iter) {
+				if (skb_headlen(iter))
+					goto normal;
+
+				len -= iter->len;
+			}
+		}
+
+		/* GSO partial only requires that we trim off any excess that
+		 * doesn't fit into an MSS sized block, so take care of that
+		 * now.
+		 */
 		partial_segs = len / mss;
 		if (partial_segs > 1)
 			mss *= partial_segs;
@@ -3109,6 +3129,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			partial_segs = 0;
 	}
 
+normal:
 	headroom = skb_headroom(head_skb);
 	pos = skb_headlen(head_skb);
 
@@ -3300,21 +3321,29 @@ perform_csum_check:
 	 */
 	segs->prev = tail;
 
-	/* Update GSO info on first skb in partial sequence. */
 	if (partial_segs) {
+		struct sk_buff *iter;
 		int type = skb_shinfo(head_skb)->gso_type;
+		unsigned short gso_size = skb_shinfo(head_skb)->gso_size;
 
 		/* Update type to add partial and then remove dodgy if set */
-		type |= SKB_GSO_PARTIAL;
+		type |= (features & NETIF_F_GSO_PARTIAL) / NETIF_F_GSO_PARTIAL * SKB_GSO_PARTIAL;
 		type &= ~SKB_GSO_DODGY;
 
 		/* Update GSO info and prepare to start updating headers on
 		 * our way back down the stack of protocols.
 		 */
-		skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size;
-		skb_shinfo(segs)->gso_segs = partial_segs;
-		skb_shinfo(segs)->gso_type = type;
-		SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset;
+		for (iter = segs; iter; iter = iter->next) {
+			skb_shinfo(iter)->gso_size = gso_size;
+			skb_shinfo(iter)->gso_segs = partial_segs;
+			skb_shinfo(iter)->gso_type = type;
+			SKB_GSO_CB(iter)->data_offset = skb_headroom(iter) + doffset;
+		}
+
+		if (tail->len - doffset <= gso_size)
+			skb_shinfo(tail)->gso_size = 0;
+		else if (tail != segs)
+			skb_shinfo(tail)->gso_segs = DIV_ROUND_UP(tail->len - doffset, gso_size);
 	}
 
 	/* Following permits correct backpressure, for protocols
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e94b47b..1effc98 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1192,7 +1192,7 @@ EXPORT_SYMBOL(inet_sk_rebuild_header);
 struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 				 netdev_features_t features)
 {
-	bool udpfrag = false, fixedid = false, encap;
+	bool udpfrag = false, fixedid = false, gso_partial, encap;
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
 	const struct net_offload *ops;
 	unsigned int offset = 0;
@@ -1245,6 +1245,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	if (IS_ERR_OR_NULL(segs))
 		goto out;
 
+	gso_partial = !!(skb_shinfo(segs)->gso_type & SKB_GSO_PARTIAL);
+
 	skb = segs;
 	do {
 		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
@@ -1259,9 +1261,13 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 				iph->id = htons(id);
 				id += skb_shinfo(skb)->gso_segs;
 			}
-			tot_len = skb_shinfo(skb)->gso_size +
-				  SKB_GSO_CB(skb)->data_offset +
-				  skb->head - (unsigned char *)iph;
+
+			if (gso_partial)
+				tot_len = skb_shinfo(skb)->gso_size +
+					  SKB_GSO_CB(skb)->data_offset +
+					  skb->head - (unsigned char *)iph;
+			else
+				tot_len = skb->len - nhoff;
 		} else {
 			if (!fixedid)
 				iph->id = htons(id++);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index ecd1e09..96e0efe 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -24,7 +24,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 	__be16 protocol = skb->protocol;
 	u16 mac_len = skb->mac_len;
 	int gre_offset, outer_hlen;
-	bool need_csum, ufo;
+	bool need_csum, ufo, gso_partial;
 
 	if (!skb->encapsulation)
 		goto out;
@@ -69,6 +69,8 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 		goto out;
 	}
 
+	gso_partial = !!(skb_shinfo(segs)->gso_type & SKB_GSO_PARTIAL);
+
 	outer_hlen = skb_tnl_header_len(skb);
 	gre_offset = outer_hlen - tnl_hlen;
 	skb = segs;
@@ -96,7 +98,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 		greh = (struct gre_base_hdr *)skb_transport_header(skb);
 		pcsum = (__sum16 *)(greh + 1);
 
-		if (skb_is_gso(skb)) {
+		if (gso_partial) {
 			unsigned int partial_adj;
 
 			/* Adjust checksum to account for the fact that
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 5c59649..bc68da3 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -90,12 +90,6 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		goto out;
 	}
 
-	/* GSO partial only requires splitting the frame into an MSS
-	 * multiple and possibly a remainder.  So update the mss now.
-	 */
-	if (features & NETIF_F_GSO_PARTIAL)
-		mss = skb->len - (skb->len % mss);
-
 	copy_destructor = gso_skb->destructor == tcp_wfree;
 	ooo_okay = gso_skb->ooo_okay;
 	/* All segments but the first should have ooo_okay cleared */
@@ -108,6 +102,13 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	/* Only first segment might have ooo_okay set */
 	segs->ooo_okay = ooo_okay;
 
+	/* GSO partial and frag_list segmentation only requires splitting
+	 * the frame into an MSS multiple and possibly a remainder, both
+	 * cases return a GSO skb. So update the mss now.
+	 */
+	if (skb_is_gso(segs))
+		mss *= skb_shinfo(segs)->gso_segs;
+
 	delta = htonl(oldlen + (thlen + mss));
 
 	skb = segs;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 81f253b..f9333c9 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -21,7 +21,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 	__be16 new_protocol, bool is_ipv6)
 {
 	int tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb);
-	bool remcsum, need_csum, offload_csum, ufo;
+	bool remcsum, need_csum, offload_csum, ufo, gso_partial;
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
 	struct udphdr *uh = udp_hdr(skb);
 	u16 mac_offset = skb->mac_header;
@@ -88,6 +88,8 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 		goto out;
 	}
 
+	gso_partial = !!(skb_shinfo(segs)->gso_type & SKB_GSO_PARTIAL);
+
 	outer_hlen = skb_tnl_header_len(skb);
 	udp_offset = outer_hlen - tnl_hlen;
 	skb = segs;
@@ -117,7 +119,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 		 * will be using a length value equal to only one MSS sized
 		 * segment instead of the entire frame.
 		 */
-		if (skb_is_gso(skb)) {
+		if (gso_partial) {
 			uh->len = htons(skb_shinfo(skb)->gso_size +
 					SKB_GSO_CB(skb)->data_offset +
 					skb->head - (unsigned char *)uh);
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 22e90e5..e7bfd55 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -69,6 +69,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	int offset = 0;
 	bool encap, udpfrag;
 	int nhoff;
+	bool gso_partial;
 
 	skb_reset_network_header(skb);
 	nhoff = skb_network_header(skb) - skb_mac_header(skb);
@@ -101,9 +102,11 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	if (IS_ERR(segs))
 		goto out;
 
+	gso_partial = !!(skb_shinfo(segs)->gso_type & SKB_GSO_PARTIAL);
+
 	for (skb = segs; skb; skb = skb->next) {
 		ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
-		if (skb_is_gso(skb))
+		if (gso_partial)
 			payload_len = skb_shinfo(skb)->gso_size +
 				      SKB_GSO_CB(skb)->data_offset +
 				      skb->head - (unsigned char *)(ipv6h + 1);
-- 
1.9.1

^ permalink raw reply related

* Re: [v2 PATCH 0/2] rhashtable: rhashtable with duplicate objects
From: Johannes Berg @ 2016-09-19 10:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Thomas Graf,
	tom-BjP2VixgY4xUbtYUoyoikg, Ben Greear
In-Reply-To: <20160919104801.GA12774-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

On Mon, 2016-09-19 at 18:48 +0800, Herbert Xu wrote:
> On Mon, Sep 19, 2016 at 12:10:27PM +0200, Johannes Berg wrote:
> > 
> > Btw, for debug I put
> > 
> > BUG_ON(atomic_read(&ht->nelems) < 0);
> > 
> > after the atomic_dec() in __rhashtable_remove_fast_one(). That
> > makes
> > the kernel crash instantly on the buggy code, and I just have to
> > run a
> > single test ("wpas_ctrl_interface_add_many") to get there.
> 
> Aha I see the problem now.  The nelems logic on remove is broken.

I looked at it for a long time, but didn't see it :) But yeah, I've
come to the same conclusion by adding debugging of the chains etc.

> I'll send out a v3.

I'll test it when I have it :)

johannes

^ permalink raw reply

* Re: [net-next PATCH] mlx4: add missed recycle opportunity for XDP_TX on TX failure
From: Tariq Toukan @ 2016-09-19 10:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, tariqt
  Cc: tom, bblanco, rana.shahot, David S. Miller
In-Reply-To: <20160919073957.32059.59156.stgit@firesoul>


On 19/09/2016 10:40 AM, Jesper Dangaard Brouer wrote:
> Correct drop handling for XDP_TX on TX failure, were recently added in
> commit 95357907ae73 ("mlx4: fix XDP_TX is acting like XDP_PASS on TX
> ring full").
>
> The change missed an opportunity for recycling the RX page, instead of
> going through the page allocator, like the regular XDP_DROP action does.
> This patch cease the opportunity, by going through the XDP_DROP case.
>
> Fixes: 95357907ae73 ("mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c |    3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index c80073e4947f..919f0604d04a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -906,11 +906,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   							length, tx_index,
>   							&doorbell_pending))
>   					goto consumed;
> -				goto next; /* Drop on xmit failure */
> +				goto xdp_drop; /* Drop on xmit failure */
>   			default:
>   				bpf_warn_invalid_xdp_action(act);
>   			case XDP_ABORTED:
>   			case XDP_DROP:
> +			xdp_drop:
>   				if (mlx4_en_rx_recycle(ring, frags))
>   					goto consumed;
>   				goto next;
>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox