Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH V2 net 02/20] net/ena: fix error handling when probe fails
From: Matt Wilson @ 2016-12-05  4:09 UTC (permalink / raw)
  To: Netanel Belgazal
  Cc: linux-kernel, davem, netdev, dwmw, zorik, alex, saeed, msw,
	aliguori, nafea
In-Reply-To: <1480857578-5065-3-git-send-email-netanel@annapurnalabs.com>

On Sun, Dec 04, 2016 at 03:19:20PM +0200, Netanel Belgazal wrote:
> When driver fails in probe, it will release all resources, including
> adapter.
> In case of probe failure, ena_remove should not try to free the adapter
> resources.

Please word wrap your commit message around 75 columns.

> Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com>

Reviewed-by: Matt Wilson <msw@amazon.com>

> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 33a760e..397c9bc 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3054,6 +3054,7 @@ static int ena_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  err_free_region:
>  	ena_release_bars(ena_dev, pdev);
>  err_free_ena_dev:
> +	pci_set_drvdata(pdev, NULL);
>  	vfree(ena_dev);
>  err_disable_device:
>  	pci_disable_device(pdev);

^ permalink raw reply

* Re: [PATCH V2 net 01/20] net/ena: remove ntuple filter support from device feature list
From: Matt Wilson @ 2016-12-05  4:08 UTC (permalink / raw)
  To: Netanel Belgazal
  Cc: linux-kernel, davem, netdev, dwmw, zorik, alex, saeed, msw,
	aliguori, nafea
In-Reply-To: <1480857578-5065-2-git-send-email-netanel@annapurnalabs.com>

On Sun, Dec 04, 2016 at 03:19:19PM +0200, Netanel Belgazal wrote:
> Remove NETIF_F_NTUPLE from netdev->features.
> The ENA device driver does not support ntuple filtering.
> 
> Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com>

Reviewed-by: Matt Wilson <msw@amazon.com>

> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index bfeaec5..33a760e 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -2729,7 +2729,6 @@ static void ena_set_dev_offloads(struct ena_com_dev_get_features_ctx *feat,
>  	netdev->features =
>  		dev_features |
>  		NETIF_F_SG |
> -		NETIF_F_NTUPLE |
>  		NETIF_F_RXHASH |
>  		NETIF_F_HIGHDMA;
>  

^ permalink raw reply

* Re: "af_unix: conditionally use freezable blocking calls in read" is wrong
From: Al Viro @ 2016-12-05  3:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, xiyou.wangcong
In-Reply-To: <20161204.214214.1868750446620130834.davem@davemloft.net>

On Sun, Dec 04, 2016 at 09:42:14PM -0500, David Miller wrote:
> > 	I've run into that converting AF_UNIX to generic_file_splice_read();
> > I can kludge around that ("freezable unless ->msg_iter is ITER_PIPE"), but
> > that only delays trouble.
> > 
> > 	Note that the only other user of freezable_schedule_timeout() is
> > a very different story - it's a kernel thread, which *does* have a guaranteed
> > locking environment.  Making such assumptions in unix_stream_recvmsg(),
> > OTOH, is insane...
> 
> We have to otherwise Android phones drain their batteries in 10
> minutes.
> 
> I'm not going to revert this and be responsible for that.
> 
> So you have to find a way to make the freezable calls legitimate.

Oh, well...  As I said, I can kludge around that - call from
generic_file_splice_read() can be distinguished by looking at the
->msg_iter->type; it still means unpleasantness for kernel_recvmsg()
users - in effect, it can only be called with locks held if you know that
the socket is not an AF_UNIX one.

BTW, how do they deal with plain pipes?

^ permalink raw reply

* [PATCH] ixgbe: initialize u64_stats_sync structures early at ixgbe_probe
From: Liwei Song @ 2016-12-05  3:40 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: intel-wired-lan, netdev, linux-kernel, liwei.song

Fix the following CallTrace:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 71 PID: 1 Comm: swapper/0 Not tainted 4.8.8-WR9.0.0.1_standard #11
Hardware name: Intel Corporation S2600WTT/S2600WTT,
BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
 00200086 00200086 eb5e1ab8 c144dd70 00000000 00000000 eb5e1af8 c10af89a
 c1d23de4 eb5e1af8 00000009 eb5d8600 eb5d8638 eb5e1af8 c10b14d8 00000009
 0000000a c1d32911 00000000 00000000 e44c826c eb5d8000 eb5e1b74 c10b214e
Call Trace:
 [<c144dd70>] dump_stack+0x5f/0x8f
 [<c10af89a>] register_lock_class+0x25a/0x4c0
 [<c10b14d8>] ? check_irq_usage+0x88/0xc0
 [<c10b214e>] __lock_acquire+0x5e/0x17a0
 [<c1abdb9b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
 [<c10cf14a>] ? rcu_read_lock_sched_held+0x8a/0x90
 [<c10b3c5f>] lock_acquire+0x9f/0x1f0
 [<c1922dcf>] ? dev_get_stats+0x5f/0x110
 [<c176e6b3>] ixgbe_get_stats64+0x113/0x320
 [<c1922dcf>] ? dev_get_stats+0x5f/0x110
 [<c1922dcf>] dev_get_stats+0x5f/0x110
 [<c1ab5415>] rtnl_fill_stats+0x40/0x105
 [<c193dd45>] rtnl_fill_ifinfo+0x4c5/0xd20
 [<c11c5115>] ? __kmalloc_node_track_caller+0x1a5/0x410
 [<c1917487>] ? __kmalloc_reserve.isra.42+0x27/0x80
 [<c191754f>] ? __alloc_skb+0x6f/0x270
 [<c1942291>] rtmsg_ifinfo_build_skb+0x71/0xd0
 [<c194230a>] rtmsg_ifinfo.part.23+0x1a/0x50
 [<c1923dad>] ? call_netdevice_notifiers_info+0x2d/0x60
 [<c194236b>] rtmsg_ifinfo+0x2b/0x40
 [<c192f997>] register_netdevice+0x3d7/0x4d0
 [<c192faa7>] register_netdev+0x17/0x30
 [<c177b83d>] ixgbe_probe+0x118d/0x1610
 [<c1498202>] local_pci_probe+0x32/0x80
 [<c1498172>] ? pci_match_device+0xd2/0x100
 [<c14991e0>] pci_device_probe+0xc0/0x110
 [<c1652cc5>] driver_probe_device+0x1c5/0x280
 [<c1498172>] ? pci_match_device+0xd2/0x100
 [<c1652e09>] __driver_attach+0x89/0x90
 [<c1652d80>] ? driver_probe_device+0x280/0x280
 [<c165114f>] bus_for_each_dev+0x4f/0x80
 [<c165269e>] driver_attach+0x1e/0x20
 [<c1652d80>] ? driver_probe_device+0x280/0x280
 [<c1652317>] bus_add_driver+0x1a7/0x220
 [<c1653a79>] driver_register+0x59/0xe0
 [<c1f897b8>] ? igb_init_module+0x49/0x49
 [<c1497b2a>] __pci_register_driver+0x4a/0x50
 [<c1f8985d>] ixgbe_init_module+0xa5/0xc4
 [<c1000485>] do_one_initcall+0x35/0x150
 [<c107e818>] ? parameq+0x18/0x70
 [<c1f395d8>] ? repair_env_string+0x12/0x51
 [<c107ead0>] ? parse_args+0x260/0x3b0
 [<c1074f73>] ? __usermodehelper_set_disable_depth+0x43/0x50
 [<c1f39e90>] kernel_init_freeable+0x19b/0x267
 [<c1f395c6>] ? set_debug_rodata+0xf/0xf
 [<c10b1e7b>] ? trace_hardirqs_on+0xb/0x10
 [<c1abdc02>] ? _raw_spin_unlock_irq+0x32/0x50
 [<c1085f0b>] ? finish_task_switch+0xab/0x1f0
 [<c1085ec9>] ? finish_task_switch+0x69/0x1f0
 [<c1ab6a30>] kernel_init+0x10/0x110
 [<c108bd65>] ? schedule_tail+0x25/0x80
 [<c1abe422>] ret_from_kernel_thread+0xe/0x24
 [<c1ab6a20>] ? rest_init+0x130/0x130

This CallTrace occurred on 32-bit kernel with CONFIG_PROVE_LOCKING
enabled.

This happens at ixgbe driver probe hardware stage, when comes to
ixgbe_get_stats64, the seqcount/seqlock still not initialize, although
this was initialize in TX/RX resources setup routin, but it was too late,
then lockdep give this Warning.

To fix this, move the u64_stats_init function to driver probe stage,
which before we get the status of seqcount and after the RX/TX ring
was finished init.

Signed-off-by: Liwei Song <liwei.song@windriver.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fee1f29..ab1d114 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5811,8 +5811,6 @@ int ixgbe_setup_tx_resources(struct ixgbe_ring *tx_ring)
 	if (!tx_ring->tx_buffer_info)
 		goto err;
 
-	u64_stats_init(&tx_ring->syncp);
-
 	/* round up to nearest 4K */
 	tx_ring->size = tx_ring->count * sizeof(union ixgbe_adv_tx_desc);
 	tx_ring->size = ALIGN(tx_ring->size, 4096);
@@ -5895,8 +5893,6 @@ int ixgbe_setup_rx_resources(struct ixgbe_ring *rx_ring)
 	if (!rx_ring->rx_buffer_info)
 		goto err;
 
-	u64_stats_init(&rx_ring->syncp);
-
 	/* Round up to nearest 4K */
 	rx_ring->size = rx_ring->count * sizeof(union ixgbe_adv_rx_desc);
 	rx_ring->size = ALIGN(rx_ring->size, 4096);
@@ -9686,6 +9682,12 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto err_sw_init;
 
+	for (i = 0; i < adapter->num_rx_queues; i++)
+		u64_stats_init(&adapter->rx_ring[i]->syncp);
+
+	for (i = 0; i < adapter->num_tx_queues; i++)
+		u64_stats_init(&adapter->tx_ring[i]->syncp);
+
 	/* WOL not supported for all devices */
 	adapter->wol = 0;
 	hw->eeprom.ops.read(hw, 0x2c, &adapter->eeprom_cap);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] net: ping: check minimum size on ICMP header length
From: Lorenzo Colitti @ 2016-12-05  3:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, netdev@vger.kernel.org, Min Chong, Qidan He,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, lkml
In-Reply-To: <20161203005853.GA117599@beast>

On Sat, Dec 3, 2016 at 9:58 AM, Kees Cook <keescook@chromium.org> wrote:
> -       if (len > 0xFFFF)
> +       if (len > 0xFFFF || len < icmph_len)
>                 return -EMSGSIZE;

EMSGSIZE usually means the message is too long. Maybe use EINVAL?
That's what the code will return if the passed-in ICMP header is
invalid (e.g., is not an echo request).

^ permalink raw reply

* Re: [PATCH V2 net 00/20] Increase ENA driver version to 1.1.2
From: Matt Wilson @ 2016-12-05  3:30 UTC (permalink / raw)
  To: David Miller
  Cc: netanel, linux-kernel, netdev, dwmw, zorik, alex, saeed, msw,
	aliguori, nafea
In-Reply-To: <20161204.213743.1610654756480441815.davem@davemloft.net>

On Sun, Dec 04, 2016 at 09:37:43PM -0500, David Miller wrote:
> 
> It is not appropriate to submit so many patches at one time.

Indeed, https://www.kernel.org/doc/Documentation/SubmittingPatches
recommends submitting no more than 15 or so at once.

> Please keep your patch series to no more than about a dozen
> at a time.

How about 15 from SubmittingPatches? The first 15 in the series are
all important bugfixes. Should Netanel resubmit a series with just the
bugfixes and a new cover letter? Or are you willing to consider the
first 15 of this series as posted?

> Also, group your changes logically and tie an appropriately
> descriptive cover letter.
> 
> "Increase driver version to X.Y.Z" tells the reader absolutely
> nothing.  Someone reading that Subject line in the GIT logs
> will have no idea what the overall purpose of the patch series
> is and what it accomplishes.

You're right, the cover letter subject needs to be better. There is
only one commit submitted with the subject "increase driver version to
1.1.2." - Patch 20/20. It is logically like:

commit b8b2372de9cc00d5ed667c7b8db29b6cfbf037f5
Author: Manish Chopra <manish.chopra@qlogic.com>
Date:   Wed Aug 3 04:02:04 2016 -0400

    qlcnic: Update version to 5.3.65
    
    Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

[...]

commit ae33256c55d2fefcad8712e750b846461994a1af
Author: Bimmy Pujari <bimmy.pujari@intel.com>
Date:   Mon Jun 20 09:10:39 2016 -0700

    i40e/i40evf-bump version to 1.6.11
    
    Signed-off-by: Bimmy Pujari <bimmy.pujari@intel.com>
    Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[...]

commit 5264cc63ba10ebfa0e54e3e641cce2656c7a60e8
Author: Jacob Keller <jacob.e.keller@intel.com>
Date:   Tue Jun 7 16:09:02 2016 -0700

    fm10k: bump version number
    
    Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
    Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[...]

commit a58a3e68037647de78e3461194239a1104f76003
Author: Michael Chan <michael.chan@broadcom.com>
Date:   Fri Jul 1 18:46:20 2016 -0400

    bnxt_en: Update firmware spec. to 1.3.0.
    
    And update driver version to 1.3.0.
    
    Signed-off-by: Michael Chan <michael.chan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

> You really need to describe the high level purpose of the patch set.
> Is it adding a new feature?  What is that feature?  Why are you
> adding that feature?  How is that feature implemented?  Why is
> it implemented that way?

The priority is to get bug fixes to the ENA driver in 4.9. Let's focus
on the first 15.

--msw

^ permalink raw reply

* [PATCH net] net: ep93xx_eth: Do not crash unloading module
From: Florian Fainelli @ 2016-12-05  3:22 UTC (permalink / raw)
  To: netdev; +Cc: davem, hsweeten, Florian Fainelli

When we unload the ep93xx_eth, whether we have opened the network
interface or not, we will either hit a kernel paging request error, or a
simple NULL pointer de-reference because:

- if ep93xx_open has been called, we have created a valid DMA mapping
  for ep->descs, when we call ep93xx_stop, we also call
  ep93xx_free_buffers, ep->descs now has a stale value

- if ep93xx_open has not been called, we have a NULL pointer for
  ep->descs, so performing any operation against that address just won't
  work

Fix this by adding a NULL pointer check for ep->descs which means that
ep93xx_free_buffers() was able to successfully tear down the descriptors
and free the DMA cookie as well.

Fixes: 1d22e05df818 ("[PATCH] Cirrus Logic ep93xx ethernet driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/cirrus/ep93xx_eth.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/cirrus/ep93xx_eth.c b/drivers/net/ethernet/cirrus/ep93xx_eth.c
index de9f7c97d916..9a161e981529 100644
--- a/drivers/net/ethernet/cirrus/ep93xx_eth.c
+++ b/drivers/net/ethernet/cirrus/ep93xx_eth.c
@@ -468,6 +468,9 @@ static void ep93xx_free_buffers(struct ep93xx_priv *ep)
 	struct device *dev = ep->dev->dev.parent;
 	int i;
 
+	if (!ep->descs)
+		return;
+
 	for (i = 0; i < RX_QUEUE_ENTRIES; i++) {
 		dma_addr_t d;
 
@@ -490,6 +493,7 @@ static void ep93xx_free_buffers(struct ep93xx_priv *ep)
 
 	dma_free_coherent(dev, sizeof(struct ep93xx_descs), ep->descs,
 							ep->descs_dma_addr);
+	ep->descs = NULL;
 }
 
 static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
-- 
2.9.3

^ permalink raw reply related

* Re: [Patch net-next] act_mirred: fix a typo in get_dev
From: Hadar Hen Zion @ 2016-12-04 13:12 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Jiri Pirko
In-Reply-To: <1480790161-8097-1-git-send-email-xiyou.wangcong@gmail.com>



On 12/3/2016 8:36 PM, Cong Wang wrote:
> Cc: Hadar Hen Zion <hadarh@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>   net/sched/act_mirred.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index bb09ba3..2d9fa6e 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -321,7 +321,7 @@ static int tcf_mirred_device(const struct tc_action *a, struct net *net,
>   	int ifindex = tcf_mirred_ifindex(a);
>   
>   	*mirred_dev = __dev_get_by_index(net, ifindex);
> -	if (!mirred_dev)
> +	if (!*mirred_dev)
>   		return -EINVAL;
>   	return 0;
>   }
Thank you for this fix! good catch.
I know it's already applied.

Hadar

^ permalink raw reply

* Re: [RFC PATCH 2/2] Documentation: devictree: Add macb mdio bindings
From: Harini Katakam @ 2016-12-05  3:03 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Rob Herring, Harini Katakam, Nicolas Ferre, David Miller,
	Pawel Moll, Mark Rutland,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org,
	Kumar Gala, Boris Brezillon,
	alexandre.belloni-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	michals-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org
In-Reply-To: <bfbc84ce-4975-3213-3b46-8c394c717ea9-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Hi Florian,

On Sun, Dec 4, 2016 at 4:10 AM, Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Le 12/03/16 à 13:35, Rob Herring a écrit :
>> On Mon, Nov 28, 2016 at 03:19:27PM +0530, Harini Katakam wrote:
>>> +- reg: Address and length of the register set of MAC to be used
>>> +- clock-names: Tuple listing input clock names.
>>> +    Required elements: 'pclk', 'hclk'
>>> +    Optional elements: 'tx_clk'
>>> +- clocks: Phandles to input clocks.
>
> You are also missing mandatory properties:
>
> #address-cells = <1> and #size-cells = <0>
>
> Where is patch 1? Can you make sure you have the same recipient list for
> both patches in this series so we can review both the binding and driver?
>

Thanks for review, I'll update.

I did send the cover letter, patch 1 and 2 to the same recipient list.
I can see them on the mailing list. The first patch is called:
[RFC PATCH 1/2] net: macb: Add MDIO driver for accessing multiple PHY devices
I hope you can find it.

Regards,
Harini
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 2/2] Documentation: devictree: Add macb mdio bindings
From: Harini Katakam @ 2016-12-05  2:55 UTC (permalink / raw)
  To: Rob Herring
  Cc: Harini Katakam, Nicolas Ferre, David Miller, Pawel Moll,
	Mark Rutland,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org,
	Kumar Gala, Boris Brezillon,
	alexandre.belloni-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	michals-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org
In-Reply-To: <20161203213553.f3agwvaseufglnq6@rob-hp-laptop>

Hi Rob,


Thanks for the review.
On Sun, Dec 4, 2016 at 3:05 AM, Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Mon, Nov 28, 2016 at 03:19:27PM +0530, Harini Katakam wrote:
<snip>
>> +Required properties:
>> +- compatible: Should be "cdns,macb-mdio"
>
> Only one version ever? This needs more specific compatible strings.
>

This is part of the Cadence MAC in a way.
I can use revision number from the Cadence spec I was working with.
But it is not necessarily specific that version.

I'll take care of the other comments in the next version.

Regards,
Harini
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: "af_unix: conditionally use freezable blocking calls in read" is wrong
From: David Miller @ 2016-12-05  2:42 UTC (permalink / raw)
  To: viro; +Cc: netdev, xiyou.wangcong
In-Reply-To: <20161204210455.GI1555@ZenIV.linux.org.uk>

From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Sun, 4 Dec 2016 21:04:55 +0000

> 	Could we please kill that kludge?  "af_unix: use freezable blocking
> calls in read" had been wrong to start with; having a method make assumptions
> of that sort ("nobody will call me while holding locks I hadn't thought of")
> is asking for serious trouble.  splice is just a place where lockdep has
> caught that - we *can't* assume that nobody will ever call kernel_recvmsg()
> while holding some locks.
> 
> 	I've run into that converting AF_UNIX to generic_file_splice_read();
> I can kludge around that ("freezable unless ->msg_iter is ITER_PIPE"), but
> that only delays trouble.
> 
> 	Note that the only other user of freezable_schedule_timeout() is
> a very different story - it's a kernel thread, which *does* have a guaranteed
> locking environment.  Making such assumptions in unix_stream_recvmsg(),
> OTOH, is insane...

We have to otherwise Android phones drain their batteries in 10
minutes.

I'm not going to revert this and be responsible for that.

So you have to find a way to make the freezable calls legitimate.

^ permalink raw reply

* Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
From: Eric Dumazet @ 2016-12-05  2:45 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Yuchung Cheng
In-Reply-To: <20161203.203708.151248945683979245.davem@davemloft.net>

On Sat, Dec 3, 2016 at 5:37 PM, David Miller <davem@davemloft.net> wrote:

>
> Sorry, just noticed by visual inspection.  I expected the
> struct sock part to show up in the same patch as the one
> that removed it from tcp_sock and adjusted the users.
>
> I'll re-review this series, thanks.

Yes, I wanted to have after patch 7, the final cache line disposition
of struct sock.
(Quite critical for future bisections if needed, or performance tests
I mentioned)

I could have used a 'unsigned long _temp_padding', but just chose the
final name for the field.

Thanks.

^ permalink raw reply

* [RFC] udp: some improvements on RX path.
From: Eric Dumazet @ 2016-12-05  2:43 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev

We currently access 3 cache lines from an skb in receive queue while
holding receive queue lock :

First cache line (contains ->next / prev pointers )
2nd cache line (skb->peeked)
3rd cache line (skb->truesize)

I believe we could get rid of skb->peeked completely.

I will cook a patch, but basically the idea is that the last owner of a
skb (right before skb->users becomes 0) can have the 'ownership' and
thus increase stats.

The 3rd cache line miss is easily avoided by the following patch.

But I also want to work on the idea I gave few days back, having a
separate queue and use splice to transfer the 'softirq queue' into
a calm queue in a different cache line.

I expect a 50 % performance increase under load, maybe 1.5 Mpps.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 16d88ba9ff1c..37d4e8da6482 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,7 +1191,13 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 /* Note: called with sk_receive_queue.lock held */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->truesize, 1);
+	/* HACK HACK HACK :
+	 * Instead of using skb->truesize here, find a copy of it in skb->dev.
+	 * This avoids a cache line miss in this path,
+	 * while sk_receive_queue lock is held.
+	 * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+	 */
+	udp_rmem_release(sk, (int)(unsigned long)skb->dev, 1);
 }
 EXPORT_SYMBOL(udp_skb_destructor);

@@ -1201,6 +1207,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	int rmem, delta, amt, err = -ENOMEM;
 	int size = skb->truesize;

+	/* help udp_skb_destructor() to get skb->truesize from skb->dev
+	 * without a cache line miss.
+	 */
+	skb->dev = (struct net_device *)(unsigned long)size;
+
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
 	 */
@@ -1233,7 +1244,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	/* no need to setup a destructor, we will explicitly release the
 	 * forward allocated memory on dequeue
 	 */
-	skb->dev = NULL;
 	sock_skb_set_dropcount(sk, skb);

 	__skb_queue_tail(list, skb);

^ permalink raw reply related

* Re: [PATCH V2 net 00/20] Increase ENA driver version to 1.1.2
From: David Miller @ 2016-12-05  2:37 UTC (permalink / raw)
  To: netanel
  Cc: linux-kernel, netdev, dwmw, zorik, alex, saeed, msw, aliguori,
	nafea
In-Reply-To: <1480857578-5065-1-git-send-email-netanel@annapurnalabs.com>

It is not appropriate to submit so many patches at one time.

Please keep your patch series to no more than about a dozen
at a time.

Also, group your changes logically and tie an appropriately
descriptive cover letter.

"Increase driver version to X.Y.Z" tells the reader absolutely
nothing.  Someone reading that Subject line in the GIT logs
will have no idea what the overall purpose of the patch series
is and what it accomplishes.

You really need to describe the high level purpose of the patch set.
Is it adding a new feature?  What is that feature?  Why are you
adding that feature?  How is that feature implemented?  Why is
it implemented that way?

^ permalink raw reply

* Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
From: David Miller @ 2016-12-04  1:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, netdev, ycheng
In-Reply-To: <1480814031.18162.439.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 03 Dec 2016 17:13:51 -0800

> On Sat, 2016-12-03 at 19:16 -0500, David Miller wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Sat,  3 Dec 2016 11:14:57 -0800
>> 
>> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> > index d8be083ab0b0..fc5848dad7a4 100644
>> > --- a/include/linux/tcp.h
>> > +++ b/include/linux/tcp.h
>> > @@ -186,7 +186,6 @@ struct tcp_sock {
>> >  	u32	tsoffset;	/* timestamp offset */
>> >  
>> >  	struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
>> > -	unsigned long	tsq_flags;
>> >  
>> >  	/* Data for direct copy to user */
>> >  	struct {
>> 
>> Hmmm, did you forget to "git add include/net/sock.h" before making
>> this commit?
> 
> sk_tsq_flags was added in prior patch in the series ( 7/8 net:
> reorganize struct sock for better data locality)
> 
> What is the problem with this part ?

Sorry, just noticed by visual inspection.  I expected the
struct sock part to show up in the same patch as the one
that removed it from tcp_sock and adjusted the users.

I'll re-review this series, thanks.

^ permalink raw reply

* Re: [PATCH v2 main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode
From: Roopa Prabhu @ 2016-12-05  2:08 UTC (permalink / raw)
  To: Felix Jia; +Cc: netdev, Carl Smith
In-Reply-To: <20161204223136.12119-1-felix.jia@alliedtelesis.co.nz>

On 12/4/16, 2:31 PM, Felix Jia wrote:
> Removed the rtnl lock and switch to use RCU lock to iterate through
> the netdev list.
>
> The address generation mode for IPv6 link-local can only be configured
> by netlink messages. This patch adds the ability to change the address
> generation mode via sysctl.
>
> An possible improvement is to remove the addrgenmode variable from the
> idev structure and use the systcl storage for the flag.
>
> The patch is based from v4.9-rc7 in mainline.
>
> Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz>
> Cc: Carl Smith <carl.smith@alliedtelesis.co.nz>
> ---
>  include/linux/ipv6.h      |  1 +
>  include/uapi/linux/ipv6.h |  1 +
>  net/ipv6/addrconf.c       | 73 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index a064997..0d9e5d4 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -64,6 +64,7 @@ struct ipv6_devconf {
>  	} stable_secret;
>  	__s32		use_oif_addrs_only;
>  	__s32		keep_addr_on_down;
> +	__s32		addrgenmode;
>  
>  	struct ctl_table_header *sysctl_header;
>  };
> diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
> index 8c27723..0524e2c 100644
> --- a/include/uapi/linux/ipv6.h
> +++ b/include/uapi/linux/ipv6.h
> @@ -178,6 +178,7 @@ enum {
>  	DEVCONF_DROP_UNSOLICITED_NA,
>  	DEVCONF_KEEP_ADDR_ON_DOWN,
>  	DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
> +	DEVCONF_ADDRGENMODE,
>  	DEVCONF_MAX
>  };
>  
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 4bc5ba3..2b83cc7 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -238,6 +238,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
>  	.use_oif_addrs_only	= 0,
>  	.ignore_routes_with_linkdown = 0,
>  	.keep_addr_on_down	= 0,
> +	.addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
>  };
>  
>  static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
> @@ -284,6 +285,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
>  	.use_oif_addrs_only	= 0,
>  	.ignore_routes_with_linkdown = 0,
>  	.keep_addr_on_down	= 0,
> +	.addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
>  };
>  
>  /* Check if a valid qdisc is available */
> @@ -378,7 +380,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
>  	if (ndev->cnf.stable_secret.initialized)
>  		ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
>  	else
> -		ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_EUI64;
> +		ndev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
>  
>  	ndev->cnf.mtu6 = dev->mtu;
>  	ndev->nd_parms = neigh_parms_alloc(dev, &nd_tbl);
> @@ -4950,6 +4952,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
>  	array[DEVCONF_DROP_UNICAST_IN_L2_MULTICAST] = cnf->drop_unicast_in_l2_multicast;
>  	array[DEVCONF_DROP_UNSOLICITED_NA] = cnf->drop_unsolicited_na;
>  	array[DEVCONF_KEEP_ADDR_ON_DOWN] = cnf->keep_addr_on_down;
> +	array[DEVCONF_ADDRGENMODE] = cnf->addrgenmode;
>  }
>  
>  static inline size_t inet6_ifla6_size(void)
> @@ -5496,6 +5499,67 @@ int addrconf_sysctl_mtu(struct ctl_table *ctl, int write,
>  	return proc_dointvec_minmax(&lctl, write, buffer, lenp, ppos);
>  }
>  
> +static void addrconf_addrgenmode_change(struct net *net)
> +{
> +	struct net_device *dev;
> +	struct inet6_dev *idev;
> +
> +	rcu_read_lock();
> +	for_each_netdev_rcu(net, dev) {
> +		idev = __in6_dev_get(dev);
> +		if (idev) {
> +			idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode;
> +			idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
> +			addrconf_dev_config(idev->dev);
> +		}
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static int addrconf_sysctl_addrgenmode(struct ctl_table *ctl, int write,
> +								void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int ret;
> +	int new_val;
> +	struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
> +	struct net *net = (struct net *)ctl->extra2;
> +
> +	if (write) { /* sysctl write request */
> +		ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> +		new_val = *((int *)ctl->data);
> +
>
unless I missed it, I don't see a check for valid values for new_val.
The netlink attribute  is checked  for valid values in the existing equivalent netlink code.

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
From: Pravin Shelar @ 2016-12-05  0:58 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161202104202.426b2c80@griffin>

On Fri, Dec 2, 2016 at 1:42 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Thu, 1 Dec 2016 12:31:09 -0800, Pravin Shelar wrote:
>> On Wed, Nov 30, 2016 at 6:30 AM, Jiri Benc <jbenc@redhat.com> wrote:
>> > I'm not opposed to changing this but I'm afraid it needs much deeper
>> > review. Because with this in place, no core kernel functions that
>> > depend on skb->protocol may be called from within openvswitch.
>> >
>> Can you give specific example where it does not work?
>
> I can't, I haven't reviewed the usage. I'm just saying that the stack
> does not expect skb->protocol being ETH_P_8021Q for e.g. IPv4 packets.
> It may not be relevant for the calls used by openvswitch but we should
> be sure about that. Especially defragmentation and conntrack is worth
> looking at.
>
> Again, I'm not saying this is wrong nor that there is an actual
> problem. I'm just pointing out that openvswitch has different
> expectations about skb wrt. vlans than the rest of the kernel and we
> should be reasonably sure the behavior is correct when passing between
> the two.
>
I agree that conntrack does not expect skb-protocol to be vlan
protocol. We could accelerate vlan if there is vlan header in packet
itself. That would make the packet consistent across upcalls.

>> skb-protocol value is set by the caller, so it should not be
>> arbitrary. is it missing in any case?
>
> It's not set exactly by the caller, because that's what this patch is
> removing. It is set by whoever handed over the packet to openvswitch.
> The point is we don't know *what* it is set to. It may as well be
> ETH_P_8021Q, breaking the conditions here. It should not happen in
> practice but still, it seems weird to depend on the fact that the
> packet coming to ovs has never skb->protocol equal to ETH_P_8021Q nor
> ETH_P_8021AD.
>

We are kind of dependent on this atleast for L3 packets injected back
by vswitchd. For rest of entry points I think we have to trust the
networking stack would set skb-protocol to correct value. If that is
not true in some case, it is bug and we will need to fix it.

^ permalink raw reply

* Re: [PATCH v2 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active
From: Saeed Mahameed @ 2016-12-05  0:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Linux Netdev List, Alexei Starovoitov, Brenden Blanco,
	Daniel Borkmann, David Miller, Jesper Dangaard Brouer,
	Saeed Mahameed, Tariq Toukan, Kernel Team
In-Reply-To: <1480821446-4122277-4-git-send-email-kafai@fb.com>

On Sun, Dec 4, 2016 at 5:17 AM, Martin KaFai Lau <kafai@fb.com> wrote:
> Reserve XDP_PACKET_HEADROOM and honor bpf_xdp_adjust_head()
> when XDP prog is active.  This patch only affects the code
> path when XDP is active.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

Hi Martin, Sorry for the late review, i have some comments below

>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 17 +++++++++++++++--
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 23 +++++++++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c     |  9 +++++----
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  3 ++-
>  4 files changed, 39 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 311c14153b8b..094a13b52cf6 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -51,7 +51,8 @@
>  #include "mlx4_en.h"
>  #include "en_port.h"
>
> -#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
> +#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
> +                                  XDP_PACKET_HEADROOM))
>
>  int mlx4_en_setup_tc(struct net_device *dev, u8 up)
>  {
> @@ -1551,6 +1552,7 @@ int mlx4_en_start_port(struct net_device *dev)
>         struct mlx4_en_tx_ring *tx_ring;
>         int rx_index = 0;
>         int err = 0;
> +       int mtu;
>         int i, t;
>         int j;
>         u8 mc_list[16] = {0};
> @@ -1684,8 +1686,12 @@ int mlx4_en_start_port(struct net_device *dev)
>         }
>
>         /* Configure port */
> +       mtu = priv->rx_skb_size + ETH_FCS_LEN;
> +       if (priv->tx_ring_num[TX_XDP])
> +               mtu += XDP_PACKET_HEADROOM;
> +

Why would the physical MTU care for the headroom you preserve for XDP prog?
This is the wire MTU, it shouldn't be changed, please keep it as
before, any preservation you make in packets buffers are needed only
for FWD case or modify case (HW or wire should not care about them).

>         err = mlx4_SET_PORT_general(mdev->dev, priv->port,
> -                                   priv->rx_skb_size + ETH_FCS_LEN,
> +                                   mtu,
>                                     priv->prof->tx_pause,
>                                     priv->prof->tx_ppp,
>                                     priv->prof->rx_pause,
> @@ -2255,6 +2261,13 @@ static bool mlx4_en_check_xdp_mtu(struct net_device *dev, int mtu)
>  {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
> +       if (mtu + XDP_PACKET_HEADROOM > priv->max_mtu) {
> +               en_err(priv,
> +                      "Device max mtu:%d does not allow %d bytes reserved headroom for XDP prog\n",
> +                      priv->max_mtu, XDP_PACKET_HEADROOM);
> +               return false;
> +       }
> +
>         if (mtu > MLX4_EN_MAX_XDP_MTU) {
>                 en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
>                        mtu, MLX4_EN_MAX_XDP_MTU);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 23e9d04d1ef4..324771ac929e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -96,7 +96,6 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>         struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
>         const struct mlx4_en_frag_info *frag_info;
>         struct page *page;
> -       dma_addr_t dma;
>         int i;
>
>         for (i = 0; i < priv->num_frags; i++) {
> @@ -115,9 +114,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>
>         for (i = 0; i < priv->num_frags; i++) {
>                 frags[i] = ring_alloc[i];
> -               dma = ring_alloc[i].dma + ring_alloc[i].page_offset;
> +               frags[i].page_offset += priv->frag_info[i].rx_headroom;

I don't see any need for headroom on frag_info other that frag0 (which
where the packet starts).
What is the meaning of a headroom of a frag in a middle of a packet ?

if you agree with me then, you can use XDP_PACKET_HEADROOM as is where
needed (i.e frag0 page offset) and remove
"priv->frag_info[i].rx_headroom"

...

After going through the code a little bit i see that this code is
shared between XDP and common path, and you didn't want to add boolean
conditions.

Ok i see what you did here.

Maybe we can pass headroom as a function parameter and split frag0
handling from the rest ?
If it is too much then i am ok with the code as it is,

> +               rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
> +                                                   frags[i].page_offset);
>                 ring_alloc[i] = page_alloc[i];
> -               rx_desc->data[i].addr = cpu_to_be64(dma);
>         }
>
>         return 0;
> @@ -250,7 +250,8 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
>
>         if (ring->page_cache.index > 0) {
>                 frags[0] = ring->page_cache.buf[--ring->page_cache.index];
> -               rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
> +               rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
> +                                                   frags[0].page_offset);
>                 return 0;
>         }
>
> @@ -889,6 +890,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                 if (xdp_prog) {
>                         struct xdp_buff xdp;
>                         dma_addr_t dma;
> +                       void *pg_addr, *orig_data;
>                         u32 act;
>
>                         dma = be64_to_cpu(rx_desc->data[0].addr);
> @@ -896,11 +898,18 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                                                 priv->frag_info[0].frag_size,
>                                                 DMA_FROM_DEVICE);
>
> -                       xdp.data = page_address(frags[0].page) +
> -                                                       frags[0].page_offset;
> +                       pg_addr = page_address(frags[0].page);
> +                       orig_data = pg_addr + frags[0].page_offset;
> +                       xdp.data = orig_data;
>                         xdp.data_end = xdp.data + length;
>
>                         act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +
> +                       if (xdp.data != orig_data) {
> +                               length = xdp.data_end - xdp.data;
> +                               frags[0].page_offset = xdp.data - pg_addr;
> +                       }
> +
>

is this needed only for XDP FWD case ?
is this the only way to detect that the user modified the packet
headers (comparing pointers, before and after) ?

if the answer is yes, it should be faster to unconditionally reset
packet offset and lenght on XDP_FWD :
case XDP_FWD:
   length = xdp.data_end - xdp.data;
   frags[0].page_offset = xdp.data - pg_addr;


>                         switch (act) {
>                         case XDP_PASS:
>                                 break;
> @@ -1180,6 +1189,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
>                  */
>                 priv->frag_info[0].frag_stride = PAGE_SIZE;
>                 priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
> +               priv->frag_info[0].rx_headroom = XDP_PACKET_HEADROOM;
>                 i = 1;
>         } else {
>                 int buf_size = 0;
> @@ -1194,6 +1204,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
>                                 ALIGN(priv->frag_info[i].frag_size,
>                                       SMP_CACHE_BYTES);
>                         priv->frag_info[i].dma_dir = PCI_DMA_FROMDEVICE;
> +                       priv->frag_info[i].rx_headroom = 0;

IMHO, redundant. as you see here frag0 and other frags handling are
separated, maybe we can do the same in mlx4_en_alloc_frags.

>                         buf_size += priv->frag_info[i].frag_size;
>                         i++;
>                 }
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..9e5f38cefe5f 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -354,7 +354,7 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
>         struct mlx4_en_rx_alloc frame = {
>                 .page = tx_info->page,
>                 .dma = tx_info->map0_dma,
> -               .page_offset = 0,
> +               .page_offset = XDP_PACKET_HEADROOM,
>                 .page_size = PAGE_SIZE,
>         };
>
> @@ -1132,7 +1132,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         tx_info->page = frame->page;
>         frame->page = NULL;
>         tx_info->map0_dma = dma;
> -       tx_info->map0_byte_count = length;
> +       tx_info->map0_byte_count = length + frame->page_offset;

Didn't you already take care of lenght by the following code:
                       if (xdp.data != orig_data) {
                               length = xdp.data_end - xdp.data;
                               frags[0].page_offset = xdp.data - pg_addr;
                        }

and here  frame->page_offset is not really page offset, it can only be
XDP_PACKET_HEADROOM.

>         tx_info->nr_txbb = nr_txbb;
>         tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
>         tx_info->data_offset = (void *)data - (void *)tx_desc;
> @@ -1141,9 +1141,10 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         tx_info->linear = 1;
>         tx_info->inl = 0;
>
> -       dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);
> +       dma_sync_single_range_for_device(priv->ddev, dma, frame->page_offset,
> +                                        length, PCI_DMA_TODEVICE);
>
> -       data->addr = cpu_to_be64(dma);
> +       data->addr = cpu_to_be64(dma + frame->page_offset);
>         data->lkey = ring->mr_key;
>         dma_wmb();
>         data->byte_count = cpu_to_be32(length);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 20a936428f4a..ba1c6cd0cc79 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -475,7 +475,8 @@ struct mlx4_en_frag_info {
>         u16 frag_prefix_size;
>         u32 frag_stride;
>         enum dma_data_direction dma_dir;
> -       int order;
> +       u16 order;
> +       u16 rx_headroom;
>  };
>
>  #ifdef CONFIG_MLX4_EN_DCB
> --
> 2.5.1
>

^ permalink raw reply

* Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
From: Eric Dumazet @ 2016-12-05  0:52 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexei Starovoitov, Martin KaFai Lau, Linux Netdev List,
	Brenden Blanco, Daniel Borkmann, David Miller, Saeed Mahameed,
	Tariq Toukan, Kernel Team
In-Reply-To: <CALzJLG_82MNOyuq+y63tR2SDmKo3ZQA5XAgT1_r15B8_V19xKg@mail.gmail.com>

On Mon, 2016-12-05 at 01:31 +0200, Saeed Mahameed wrote:

> Alexei, we should start considering PPC archs for XDP use cases,
> demanding page per packet on those archs is a little bit heavy
> requirement

Well, 'little' is an understatement ;)

Note that PPC had serious problems before  commit bd68a2a854ad5a85f0
("net: set SK_MEM_QUANTUM to 4096")

So I suspect one page per frame will likely be a huge problem
for hosts dealing with 10^5 or more TCP sockets.

Either skb->truesize is set to 64KB and TCP window must be really tiny,
or skb->truesize is set to ~2KB and OOM is waiting to happen.

^ permalink raw reply

* Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU
From: Vivien Didelot @ 2016-12-05  0:23 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David Miller, netdev
In-Reply-To: <20161204224054.GA24118@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

>> You can have several implementations in the same file (e.g. global1.c),
>> so again the only value is the function name, not the struct member.
>
> The structure member have g1_ has a lot of value.
>
>         if (chip->info->ops->set_cpu_port) {
>                 err = chip->info->ops->set_cpu_port(chip, upstream_port);
>                 if (err)
>                         return err;
>         }
>
> Where to i need to go look for set_cpu_port? I have no idea.

In your chip's ops definition, as for any ops structure. Same as for
your example right below which is unfortunately not a solution per-se.

>
>         if (chip->info->ops->g1_set_cpu_port) {
>                 err = chip->info->ops->g1_set_cpu_port(chip, upstream_port);
>                 if (err)
>                         return err;
>         }
>
> Humm, the hint tells me it is in global1.c. And i also know that all
> of them are in global1.c.

Until a new chip relocates a feature somewhere else.

Then you'll have to rename the structure member(s) because you have a
policy saying "no prefix means different set of registers".


       Vivien

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
From: Pravin Shelar @ 2016-12-05  0:22 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161202102509.065df1e8@griffin>

On Fri, Dec 2, 2016 at 1:25 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Thu, 1 Dec 2016 11:50:00 -0800, Pravin Shelar wrote:
>> This is not changing any behavior compared to current OVS vlan checks.
>> Single vlan header is not considered for MTU check.
>
> It is changing it.
>
> Consider the case when there's an interface with MTU 1500 forwarding to
> an interface with MTU 1496. Obviously, full-sized vlan frames
> ingressing on the first interface are not forwardable to the second
> one. Yet, if the vlan tag is accelerated (and thus not counted in
> skb->len), is_skb_forwardable happily returns true because of the check
>
>         len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
>         if (skb->len <= len)
>
ok, This case would be allowed due to this patch. But core linux stack
and bridge is using this check then why not just use same forwarding
check in OVS too, this make it consistent with core networking
forwarding expectations.

^ permalink raw reply

* Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
From: Saeed Mahameed @ 2016-12-04 23:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Martin KaFai Lau, Linux Netdev List, Brenden Blanco,
	Daniel Borkmann, David Miller, Saeed Mahameed, Tariq Toukan,
	Kernel Team
In-Reply-To: <58421775.6090905@fb.com>

On Sat, Dec 3, 2016 at 2:53 AM, Alexei Starovoitov <ast@fb.com> wrote:
> On 12/2/16 4:38 PM, Eric Dumazet wrote:
>>
>> On Fri, 2016-12-02 at 15:23 -0800, Martin KaFai Lau wrote:
>>>
>>> When XDP prog is attached, it is currently limiting
>>> MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514
>>> in x86.
>>>
>>> AFAICT, since mlx4 is doing one page per packet for XDP,
>>> we can at least raise the MTU limitation up to
>>> PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is
>>> doing.  It will be useful in the next patch which allows
>>> XDP program to extend the packet by adding new header(s).
>>>
>>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>>> ---
>>
>>
>> Have you tested your patch on a host with PAGE_SIZE = 64 KB ?
>>
>> Looks XDP really kills arches with bigger pages :(
>
>
> I'm afraid xdp mlx[45] support was not tested on arches
> with 64k pages at all. Not just this patch.

Yep, in mlx5 page per packet became the default, with or without XDP,
unlike mlx4.
currently we allow 64KB pages per packet! which is wrong and need to be fixed.

I will get to this task soon.

> I think people who care about such archs should test?

We do test mlx5 and mlx4 on PPC arch. other than we require more
memory than we need, we don't see any issues. and we don't test XDP on
those archs.

> Note page per packet is not a hard requirement for all drivers
> and all archs. For mlx[45] it was the easiest and the most
> convenient way to achieve desired performance.
> If there are ways to do the same performance differently,
> I'm all ears :)
>

when bigger pages, i.e  PAGE_SIZE > 8K, my current low hanging fruit
options for mlx5 are
1. start sharing pages for multi packets.
2. Go back to the SKB allocator (allocate ring of SKBs on advance
rather than page per packet/s).

this means that default RX memory scheme will be different than XDP's
on such ARCHs (XDP wil still use page per packet)

Alexei, we should start considering PPC archs for XDP use cases,
demanding page per packet on those archs is a little bit heavy
requirement

^ permalink raw reply

* Re: [PATCH] mlx4: Use kernel sizeof and alloc styles
From: Joe Perches @ 2016-12-04 22:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yishai Hadas, Tariq Toukan, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1480885139.18162.484.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

On Sun, 2016-12-04 at 12:58 -0800, Eric Dumazet wrote:
> On Sun, 2016-12-04 at 12:11 -0800, Joe Perches wrote:
> > Convert sizeof foo to sizeof(foo) and allocations with multiplications
> > to the appropriate kcalloc/kmalloc_array styles.
> > 
> > Signed-off-by: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
> > ---
> 
> Gah.
> 
> This is one of the hotest NIC driver on linux at this moment, 
> with XDP and other efforts going on.
> 
> Some kmalloc() are becoming kmalloc_node() in some dev branches, and
> there is no kmalloc_array_node() yet.

Well that kmalloc_array_node, like this patch, is pretty trivial to add.
Something like the attached for kmalloc_array_node and kcalloc_node.

> This kind of patch is making rebases/backports very painful.

That's really not an issue for me.

> Could we wait ~6 months before doing such cleanup/changes please ?

This is certainly a trivial patch that could be
done at almost any time.

> If you believe a bug needs a fix, please send a patch to address it.
> 
> Thanks.

No worries.

[-- Attachment #2: slab.diff --]
[-- Type: text/x-patch, Size: 1494 bytes --]

 include/linux/slab.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..d98c07713c03 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -647,6 +647,37 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
 	return kmalloc_node(size, flags | __GFP_ZERO, node);
 }
 
+/**
+ * kmalloc_array_node - allocate memory for an array
+ * from a particular memory node.
+ * @n: number of elements.
+ * @size: element size.
+ * @flags: the type of memory to allocate (see kmalloc).
+ * @node: memory node from which to allocate
+ */
+static inline void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+				       int node)
+{
+	if (size != 0 && n > SIZE_MAX / size)
+		return NULL;
+	if (__builtin_constant_p(n) && __builtin_constant_p(size))
+		return kmalloc_node(n * size, flags, node);
+	return __kmalloc_node(n * size, flags, node);
+}
+
+/**
+ * kcalloc_node - allocate memory for an array from a particular memory node.
+ * The memory is set to zero.
+ * @n: number of elements.
+ * @size: element size.
+ * @flags: the type of memory to allocate (see kmalloc).
+ * @node: memory node from which to allocate
+ */
+static inline void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
+{
+	return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
+}
+
 unsigned int kmem_cache_size(struct kmem_cache *s);
 void __init kmem_cache_init_late(void);
 

^ permalink raw reply related

* Re: [PATCH v1 net-next 1/5] net: dsa: mv88e6xxx: Reserved Management frames to CPU
From: Andrew Lunn @ 2016-12-04 22:40 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: David Miller, netdev
In-Reply-To: <87oa0r9wei.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>

> You can have several implementations in the same file (e.g. global1.c),
> so again the only value is the function name, not the struct member.

The structure member have g1_ has a lot of value.

        if (chip->info->ops->set_cpu_port) {
                err = chip->info->ops->set_cpu_port(chip, upstream_port);
                if (err)
                        return err;
        }

Where to i need to go look for set_cpu_port? I have no idea.

        if (chip->info->ops->g1_set_cpu_port) {
                err = chip->info->ops->g1_set_cpu_port(chip, upstream_port);
                if (err)
                        return err;
        }

Humm, the hint tells me it is in global1.c. And i also know that all
of them are in global1.c.

These ops do make the code simpler. But the downside is it makes it
harder to find the actual code, now that it is spread over multiple
files. And these hits help negate the downside a little.

   Andrew

^ permalink raw reply

* [PATCH] net: calxeda: xgmac: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2016-12-04 22:37 UTC (permalink / raw)
  To: davem, jarod; +Cc: netdev, linux-kernel, Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
 drivers/net/ethernet/calxeda/xgmac.c |   17 ++++++++---------
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c
index 6e72366..ce7de6f 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -1530,15 +1530,14 @@ static int xgmac_set_features(struct net_device *dev, netdev_features_t features
 	.ndo_set_features = xgmac_set_features,
 };
 
-static int xgmac_ethtool_getsettings(struct net_device *dev,
-					  struct ethtool_cmd *cmd)
+static int xgmac_ethtool_get_link_ksettings(struct net_device *dev,
+					    struct ethtool_link_ksettings *cmd)
 {
-	cmd->autoneg = 0;
-	cmd->duplex = DUPLEX_FULL;
-	ethtool_cmd_speed_set(cmd, 10000);
-	cmd->supported = 0;
-	cmd->advertising = 0;
-	cmd->transceiver = XCVR_INTERNAL;
+	cmd->base.autoneg = 0;
+	cmd->base.duplex = DUPLEX_FULL;
+	cmd->base.speed = 10000;
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported, 0);
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising, 0);
 	return 0;
 }
 
@@ -1681,7 +1680,6 @@ static int xgmac_set_wol(struct net_device *dev,
 }
 
 static const struct ethtool_ops xgmac_ethtool_ops = {
-	.get_settings = xgmac_ethtool_getsettings,
 	.get_link = ethtool_op_get_link,
 	.get_pauseparam = xgmac_get_pauseparam,
 	.set_pauseparam = xgmac_set_pauseparam,
@@ -1690,6 +1688,7 @@ static int xgmac_set_wol(struct net_device *dev,
 	.get_wol = xgmac_get_wol,
 	.set_wol = xgmac_set_wol,
 	.get_sset_count = xgmac_get_sset_count,
+	.get_link_ksettings = xgmac_ethtool_get_link_ksettings,
 };
 
 /**
-- 
1.7.4.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox