Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net][v2] bpf: fix states equal logic for varlen access
From: Daniel Borkmann @ 2016-11-29 19:09 UTC (permalink / raw)
  To: Josef Bacik, davem, netdev, ast, jannh, kernel-team
In-Reply-To: <1480440429-2531-1-git-send-email-jbacik@fb.com>

On 11/29/2016 06:27 PM, Josef Bacik wrote:
> If we have a branch that looks something like this
>
> int foo = map->value;
> if (condition) {
>    foo += blah;
> } else {
>    foo = bar;
> }
> map->array[foo] = baz;
>
> We will incorrectly assume that the !condition branch is equal to the condition
> branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
> adjust this logic to only do this if we didn't do a varlen access after we
> processed the !condition branch, otherwise we have different ranges and need to
> check the other branch as well.
>
> Fixes: 484611357c19 ("bpf: allow access into map value arrays")
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net-next] bpf: add test for the verifier equal logic bug
From: Daniel Borkmann @ 2016-11-29 19:06 UTC (permalink / raw)
  To: Josef Bacik, davem, netdev, ast, jannh, kernel-team
In-Reply-To: <1480440919-3252-1-git-send-email-jbacik@fb.com>

On 11/29/2016 06:35 PM, Josef Bacik wrote:
> This is a test to verify that
>
> bpf: fix states equal logic for varlen access
>
> actually fixed the problem.  The problem was if the register we added to our map
> register was UNKNOWN in both the false and true branches and the only thing that
> changed was the range then we'd incorrectly assume that the true branch was
> valid, which it really wasnt.  This tests this case and properly fails without
> my fix in place and passes with it in place.
>
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks a lot for the test case! They are always useful to have ... which
just reminds me: it seems we didn't add anything for f23cc643f9ba ("bpf:
fix range arithmetic for bpf map access"). ;-)

^ permalink raw reply

* Re: [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Florian Fainelli @ 2016-11-29 18:58 UTC (permalink / raw)
  To: Woojung.Huh, davem, andrew; +Cc: netdev, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D4096FC85@CHN-SV-EXMX02.mchp-main.com>

On 11/29/2016 10:49 AM, Woojung.Huh@microchip.com wrote:
>>> +	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
>>> +	buf = phy_read_mmd_indirect(phydev, 32784, 3);
>>> +	buf &= ~0x1800;
>>> +	buf |= 0x0800;
>>> +	phy_write_mmd_indirect(phydev, 32784, 3, buf);
>>
>> Using decimal numbers for register addresses is a bit unusual.
> 
> OK. Will change it.
> 
>>> +
>>> +	/* RGMII MAC TXC Delay Enable */
>>> +	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
>>> +				MAC_RGMII_ID_TXC_DELAY_EN_);
>>> +
>>> +	/* RGMII TX DLL Tune Adjust */
>>> +	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
>>> +
>>> +	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
>>> +}
>>> +
>>> +static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
>>> +				  struct phy_device *phydev,
>>> +				  phy_interface_t *interface)
>>> +{
>>> +	/* Micrel9301RNX PHY configuration */
>>> +	/* RGMII Control Signal Pad Skew */
>>> +	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
>>> +	/* RGMII RX Data Pad Skew */
>>> +	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
>>> +	/* RGMII RX Clock Pad Skew */
>>> +	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
>>> +
>>> +	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
>>> +}
>>
>> This should really belong in the respective PHY drivers for these PHYs,
>> is there a particular reason you decided to do this here?
> 
> Main reason is because these are MAC dependent.
> In case these are in PHY driver, I expect some parameters/defines may need to be passed to it.
> Trying to keep PHY driver as generic as possible.

There are two ways to get these settings propagated to the PHY driver:

- using a board fixup which is going to be invoked during
drv->config_init() time

- specifying a phydev->dev_flags and reading it from the PHY driver to
act upon and configure the PHY based on that value, there are only
32-bits available though, and you need to make sure they are not
conflicting with other potential users in tree

My preference would go with 1, since you could just register it in your
PHY driver and re-use the code you are proposing to include here.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] bpf: add test for the verifier equal logic bug
From: Alexei Starovoitov @ 2016-11-29 18:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: davem, netdev, ast, jannh, daniel, kernel-team
In-Reply-To: <1480440919-3252-1-git-send-email-jbacik@fb.com>

On Tue, Nov 29, 2016 at 12:35:19PM -0500, Josef Bacik wrote:
> This is a test to verify that
> 
> bpf: fix states equal logic for varlen access
> 
> actually fixed the problem.  The problem was if the register we added to our map
> register was UNKNOWN in both the false and true branches and the only thing that
> changed was the range then we'd incorrectly assume that the true branch was
> valid, which it really wasnt.  This tests this case and properly fails without
> my fix in place and passes with it in place.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Awesome. thanks for the test!
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net][v2] bpf: fix states equal logic for varlen access
From: Alexei Starovoitov @ 2016-11-29 18:52 UTC (permalink / raw)
  To: Josef Bacik; +Cc: davem, netdev, ast, jannh, daniel, kernel-team
In-Reply-To: <1480440429-2531-1-git-send-email-jbacik@fb.com>

On Tue, Nov 29, 2016 at 12:27:09PM -0500, Josef Bacik wrote:
> If we have a branch that looks something like this
> 
> int foo = map->value;
> if (condition) {
>   foo += blah;
> } else {
>   foo = bar;
> }
> map->array[foo] = baz;
> 
> We will incorrectly assume that the !condition branch is equal to the condition
> branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
> adjust this logic to only do this if we didn't do a varlen access after we
> processed the !condition branch, otherwise we have different ranges and need to
> check the other branch as well.
> 
> Fixes: 484611357c19 ("bpf: allow access into map value arrays")
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Thank you!
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: bpf debug info
From: Alexei Starovoitov @ 2016-11-29 18:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, netdev, Brenden Blanco, Thomas Graf, Wangnan,
	He Kuang, kernel-team
In-Reply-To: <20161129153818.51104fab@jkicinski-Precision-T1700>

On Tue, Nov 29, 2016 at 03:38:18PM +0000, Jakub Kicinski wrote:
> 
> >>> [...]
> > > So next step is to improve verifier messages to be more human friendly.
> > > The step after is to introduce BPF_COMMENT pseudo instruction
> > > that will be ignored by the interpreter yet it will contain the text
> > > of original source code. Then llvm-objdump step won't be necessary.
> > > The bpf loader will load both instructions and pieces of C sources.
> > > Then verifier errors should be even easier to read and humans
> > > can easily understand the purpose of the program.  
> > 
> > So the BPF_COMMENT pseudo insn will get stripped away from the insn array
> > after verification step, so we don't need to hold/account for this mem? I
> > assume in it's ->imm member it will just hold offset into text blob?
> 
> Associating any form of opaque data with programs always makes me
> worried about opening a side channel of communication with a specialized
> user space implementations/compilers.  But I guess if the BPF_COMMENTs
> are stripped in the verifier as Daniel assumes drivers and JITs will
> never see it.

yes. the idea that it's a comment. It can contain any text,
not only C code, but any other language.
It's definitely going to be stripped before JITs and kernel will
not make any safety or translation decisions based on such comment.

> Just to clarify, however - is there any reason why pushing the source
> code into the kernel is necessary?  Or is it just for convenience?
> Provided the user space loader has access to the debug info it should
> have no problems matching the verifier output to code lines?

correct. just for convenience. The user space has to keep .o around,
since it can crash, would have to reload and so on.
Only for some script that ssh-es into servers and wants to see
what is being loaded, it might help to dump full asm and these comments
along with prog_digest that Daniel is working on in parallel.
Alternatively instead of doing BPF_COMMENT we can load the whole .o
as-is into bpffs as a blob. Later (based on digest) the kernel can
dump such .o back for user space to run objdump on. It all can be
done without kernel involvement. Like tc command can copy .o and so on.
But not everything is using tc.
Another alternative is to do a decompiler from bpf binary
into meaningful C code. It's not trivial and names will be lost.
bpf_comment approach is pretty cheap from kernel point of view
and greatly helps visibility when users don't cheat with debug info.

^ permalink raw reply

* RE: [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Woojung.Huh @ 2016-11-29 18:49 UTC (permalink / raw)
  To: f.fainelli, davem, andrew; +Cc: netdev, UNGLinuxDriver
In-Reply-To: <007d0bc0-378c-da93-d730-28f7b5b5073a@gmail.com>

> > +	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
> > +	buf = phy_read_mmd_indirect(phydev, 32784, 3);
> > +	buf &= ~0x1800;
> > +	buf |= 0x0800;
> > +	phy_write_mmd_indirect(phydev, 32784, 3, buf);
> 
> Using decimal numbers for register addresses is a bit unusual.

OK. Will change it.

> > +
> > +	/* RGMII MAC TXC Delay Enable */
> > +	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
> > +				MAC_RGMII_ID_TXC_DELAY_EN_);
> > +
> > +	/* RGMII TX DLL Tune Adjust */
> > +	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
> > +
> > +	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
> > +}
> > +
> > +static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
> > +				  struct phy_device *phydev,
> > +				  phy_interface_t *interface)
> > +{
> > +	/* Micrel9301RNX PHY configuration */
> > +	/* RGMII Control Signal Pad Skew */
> > +	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
> > +	/* RGMII RX Data Pad Skew */
> > +	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
> > +	/* RGMII RX Clock Pad Skew */
> > +	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
> > +
> > +	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
> > +}
> 
> This should really belong in the respective PHY drivers for these PHYs,
> is there a particular reason you decided to do this here?

Main reason is because these are MAC dependent.
In case these are in PHY driver, I expect some parameters/defines may need to be passed to it.
Trying to keep PHY driver as generic as possible.

Thanks for your comments.
- Woojung

^ permalink raw reply

* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Daniel Borkmann @ 2016-11-29 18:28 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Yuval Mintz, davem, netdev, alexei.starovoitov
In-Reply-To: <20161129171020.6b8b552d@jkicinski-Precision-T1700>

On 11/29/2016 06:10 PM, Jakub Kicinski wrote:
> On Tue, 29 Nov 2016 16:48:50 +0100, Daniel Borkmann wrote:
>> On 11/29/2016 03:47 PM, Yuval Mintz wrote:
>>> Add support for the ndo_xdp callback. This patch would support XDP_PASS,
>>> XDP_DROP and XDP_ABORTED commands.
>>>
>>> This also adds a per Rx queue statistic which counts number of packets
>>> which didn't reach the stack [due to XDP].
>>>
>>> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
>> [...]
>>> @@ -1560,6 +1593,7 @@ static int qede_rx_process_cqe(struct qede_dev *edev,
>>>    			       struct qede_fastpath *fp,
>>>    			       struct qede_rx_queue *rxq)
>>>    {
>>> +	struct bpf_prog *xdp_prog = READ_ONCE(rxq->xdp_prog);
>>>    	struct eth_fast_path_rx_reg_cqe *fp_cqe;
>>>    	u16 len, pad, bd_cons_idx, parse_flag;
>>>    	enum eth_rx_cqe_type cqe_type;
>>> @@ -1596,6 +1630,11 @@ static int qede_rx_process_cqe(struct qede_dev *edev,
>>>    	len = le16_to_cpu(fp_cqe->len_on_first_bd);
>>>    	pad = fp_cqe->placement_offset;
>>>
>>> +	/* Run eBPF program if one is attached */
>>> +	if (xdp_prog)
>>> +		if (!qede_rx_xdp(edev, fp, rxq, xdp_prog, bd, fp_cqe))
>>> +			return 1;
>>> +
>>
>> You also need to wrap this under rcu_read_lock() (at least I haven't seen
>> it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
>> protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
>> disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
>> doesn't.
>
> My understanding was that Yuval is always doing full stop()/start() so
> there should be no RX packets in flight while the XDP prog is being
> changed.  But thinking about it again, perhaps is worth adding the

Ohh, true, thanks for pointing this out. I guess I got confused by
the READ_ONCE() then.

> optimization to forego the full qede_reload() in qede_xdp_set() if there
> is a program already loaded and just do the xchg()+put() (and add RCU
> protection on the fast path)?

Would be worth it as a follow-up later on, yes.

^ permalink raw reply

* [PATCH RFC v3] ethtool: implement helper to get flow_type value
From: Jacob Keller @ 2016-11-29 18:45 UTC (permalink / raw)
  To: netdev, Intel Wired LAN; +Cc: David Miller, Jakub Kicinski, Jacob Keller

Often a driver wants to store the flow type and thus it must mask the
extra fields. This is a task that could grow more complex as more flags
are added in the future. Add a helper function that masks the flags for
marking additional fields.

Modify drivers in drivers/net/ethernet that currently check for FLOW_EXT
and FLOW_MAC_EXT to use the helper. Currently this is only the mellanox
drivers.

I chose not to modify other drivers as I'm actually unsure whether we
should always mask the flow type even for drivers which don't recognize
the newer flags. On the one hand, today's drivers (generally)
automatically fail when a new flag is used because they won't mask it
and their checks against flow_type will not match. On the other hand, it
means another place that you have to update when you begin implementing
a flag.

An alternative is to have the driver store a set of flags that it knows
about, and then have ethtool core do the check for us to discard frames.
I haven't implemented this quite yet.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c         | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c | 6 +++---
 include/uapi/linux/ethtool.h                            | 5 +++++
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 487a58f9c192..d8f9839ce2a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1270,7 +1270,7 @@ static int mlx4_en_validate_flow(struct net_device *dev,
 			return -EINVAL;
 	}
 
-	switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(cmd->fs.flow_type)) {
 	case TCP_V4_FLOW:
 	case UDP_V4_FLOW:
 		if (cmd->fs.m_u.tcp_ip4_spec.tos)
@@ -1493,7 +1493,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	if (err)
 		return err;
 
-	switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(cmd->fs.flow_type)) {
 	case ETHER_FLOW:
 		spec_l2 = kzalloc(sizeof(*spec_l2), GFP_KERNEL);
 		if (!spec_l2)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
index 3691451c728c..066e6c5cf38b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
@@ -63,7 +63,7 @@ static struct mlx5e_ethtool_table *get_flow_table(struct mlx5e_priv *priv,
 	int table_size;
 	int prio;
 
-	switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(fs->flow_type)) {
 	case TCP_V4_FLOW:
 	case UDP_V4_FLOW:
 		max_tuples = ETHTOOL_NUM_L3_L4_FTS;
@@ -147,7 +147,7 @@ static int set_flow_attrs(u32 *match_c, u32 *match_v,
 					     outer_headers);
 	void *outer_headers_v = MLX5_ADDR_OF(fte_match_param, match_v,
 					     outer_headers);
-	u32 flow_type = fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
+	u32 flow_type = ethtool_get_flow_spec_type(fs->flow_type);
 	struct ethtool_tcpip4_spec *l4_mask;
 	struct ethtool_tcpip4_spec *l4_val;
 	struct ethtool_usrip4_spec *l3_mask;
@@ -393,7 +393,7 @@ static int validate_flow(struct mlx5e_priv *priv,
 	    fs->ring_cookie != RX_CLS_FLOW_DISC)
 		return -EINVAL;
 
-	switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(fs->flow_type)) {
 	case ETHER_FLOW:
 		eth_mask = &fs->m_u.ether_spec;
 		if (!is_zero_ether_addr(eth_mask->h_dest))
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index f0db7788f887..ed650eef9e54 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1583,6 +1583,11 @@ static inline int ethtool_validate_duplex(__u8 duplex)
 #define	FLOW_EXT	0x80000000
 #define	FLOW_MAC_EXT	0x40000000
 
+static inline __u32 ethtool_get_flow_spec_type(__u32 flow_type)
+{
+	return flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
+}
+
 /* L3-L4 network traffic flow hash options */
 #define	RXH_L2DA	(1 << 1)
 #define	RXH_VLAN	(1 << 2)
-- 
2.11.0.rc2.152.g4d04e67

^ permalink raw reply related

* Re: [PATCH RFC v2] ethtool: implement helper to get flow_type value
From: Keller, Jacob E @ 2016-11-29 18:44 UTC (permalink / raw)
  To: kubakici@wp.pl
  Cc: netdev@vger.kernel.org, davem@davemloft.net,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <20161129152144.484bea62@jkicinski-Precision-T1700>

On Tue, 2016-11-29 at 15:21 +0000, Jakub Kicinski wrote:
> On Mon, 28 Nov 2016 15:03:43 -0800, Jacob Keller wrote:
> > +static inline __u32 ethtool_get_flow_spec_type(__u32 flow_type)
> > +{
> > +	return flow_type & (FLOW_EXT | FLOW_MAC_EXT);
> 
> I don't have anything of substance to say but I think you are missing
> a
> negation (~) here compared to the code you are replacing ;)

HAH! Yes you are right. I made a mistake when copying this out of my
driver header file.

Will fix. Sorry for the thrash, and thanks for catching my mistake.

Regards,
Jake

^ permalink raw reply

* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Daniel Borkmann @ 2016-11-29 18:42 UTC (permalink / raw)
  To: Mintz, Yuval, Jakub Kicinski
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	alexei.starovoitov@gmail.com
In-Reply-To: <BL2PR07MB23061A22119C747AA7E9A2FA8D8D0@BL2PR07MB2306.namprd07.prod.outlook.com>

On 11/29/2016 06:51 PM, Mintz, Yuval wrote:
>>> You also need to wrap this under rcu_read_lock() (at least I haven't seen
>>> it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
>>> protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
>>> disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
>>> doesn't.
>
>> My understanding was that Yuval is always doing full stop()/start() so
>> there should be no RX packets in flight while the XDP prog is being
>> changed.  But thinking about it again, perhaps is worth adding the
>> optimization to forego the full qede_reload() in qede_xdp_set() if there
>> is a program already loaded and just do the xchg()+put() (and add RCU
>> protection on the fast path)?
>
> Yeps. That the current state of the code.
> I'll surely pursue this later, but I don't think all this added complexity
> is required for the initial submission.
>
> BTW, one of the problems [or perhaps 'lack of motivation' is a better term]
> I had with the program switching scenario was that there was no sample
> application that did it.
> If it's really an interesting [basic] scenario, perhaps it's worthy to add
> a sample user app. that will repeatedly switch the attached eBPF?

Fwiw, I'm still waiting for Stephen to process his queue, and then I have
a patch for iproute2 to add a minimal initial front-end that can be useful
for experimenting/testing. The atomic switching scenario w/o stop()/start()
would definitely be useful when you need to fix an issue or modify behavior
in your currently loaded program on the fly when you cannot afford small
downtime.

^ permalink raw reply

* Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: David Miller @ 2016-11-29 18:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tariqt
In-Reply-To: <1480088780.8455.543.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Nov 2016 07:46:20 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> mlx4 stats are chaotic because a deferred work queue is responsible
> to update them every 250 ms.
> 
> Even sampling stats every one second with "sar -n DEV 1" gives
> variations like the following :
 ...
> This patch allows rx/tx bytes/packets counters being folded at the
> time we need stats.
> 
> We now can fetch stats every 1 ms if we want to check NIC behavior
> on a small time window. It is also easier to detect anomalies.
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>

Applied to net-next, thanks Eric.

^ permalink raw reply

* [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Robert Shearman @ 2016-11-29 17:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, Alexander Duyck, Robert Shearman

With certain distributions of routes it can take a long time to add
and delete further routes at scale. For example, with a route for
10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
is update_suffix:

      40.39%  [kernel]                      [k] update_suffix
       8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup

With these changes, the time is reduced to 4s for the same scale and
distribution of routes.

The issue is that update_suffix does an O(n) walk on the children of a
node and the with a dense distribtion of routes the number of children
in a node tends towards the number of nodes in the tree.

In the add case it isn't necessary to walk all the other children to
update the largest suffix length of the node (which is what
update_suffix is doing) since we already know what the largest suffix
length of all the other children is and we only need to update it if
the new node/leaf has a larger suffix. However, it is currently called
from resize which is called on any update to rebalance the trie.

Therefore improve the scaling by moving the responsibility of
recalculating the node's largest suffix length out of the resize
function into its callers. For fib_insert_node this can be taken care
of by extending put_child to not only update the largest suffix length
in the parent node, but to propagate it all the way up the trie as
required, using a new function, node_push_suffix, based on
leaf_push_suffix, but renamed to reflect its purpose and made safe if
the node has no parent.

No changes are required to inflate/halve since the maximum suffix
length of the sub-trie starting from the node's parent isn't changed,
and the suffix lengths of the halved/inflated nodes are updated
through the creation of the new nodes and put_child called to add the
children to them.

In both fib_table_flush and fib_table_flush_external, given that a
walk of the entire trie is being done there's a choice of optimising
either for the case of a small number of leafs being flushed by
updating the suffix on a change and propagating it up, or optimising
for large numbers of leafs being flushed by deferring the updating of
the largest suffix length to the walk up to the parent node. I opted
for the latter to keep the algorithm linear in complexity in the case
of flushing all leafs and because it is close to the status quo.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 net/ipv4/fib_trie.c | 86 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 026f309c51e9..701cae8af44a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct key_vector *n)
 	return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
 }

+static void node_push_suffix(struct key_vector *tn, struct key_vector *l)
+{
+	while (tn->slen < l->slen) {
+		tn->slen = l->slen;
+		tn = node_parent(tn);
+		if (!tn)
+			break;
+	}
+}
+
 /* Add a child at position i overwriting the old value.
  * Update the value of full_children and empty_children.
+ *
+ * The suffix length of the parent node and the rest of the tree is
+ * updated (if required) when adding/replacing a node, but is caller's
+ * responsibility on removal.
  */
 static void put_child(struct key_vector *tn, unsigned long i,
 		      struct key_vector *n)
@@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i,
 	else if (!wasfull && isfull)
 		tn_info(tn)->full_children++;

-	if (n && (tn->slen < n->slen))
-		tn->slen = n->slen;
+	if (n)
+		node_push_suffix(tn, n);

 	rcu_assign_pointer(tn->tnode[i], n);
 }
@@ -919,34 +933,35 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 	if (max_work != MAX_WORK)
 		return tp;

-	/* push the suffix length to the parent node */
-	if (tn->slen > tn->pos) {
-		unsigned char slen = update_suffix(tn);
-
-		if (slen > tp->slen)
-			tp->slen = slen;
-	}
-
 	return tp;
 }

-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
+static void node_set_suffix(struct key_vector *tp, unsigned char slen)
 {
-	while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-		if (update_suffix(tp) > l->slen)
+	if (slen > tp->slen)
+		tp->slen = slen;
+}
+
+static void node_pull_suffix(struct key_vector *tn)
+{
+	struct key_vector *tp;
+	unsigned char slen;
+
+	slen = update_suffix(tn);
+	tp = node_parent(tn);
+	while ((tp->slen > tp->pos) && (tp->slen > slen)) {
+		if (update_suffix(tp) > slen)
 			break;
 		tp = node_parent(tp);
 	}
 }

-static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
+static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
 {
-	/* if this is a new leaf then tn will be NULL and we can sort
-	 * out parent suffix lengths as a part of trie_rebalance
-	 */
-	while (tn->slen < l->slen) {
-		tn->slen = l->slen;
-		tn = node_parent(tn);
+	while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
+		if (update_suffix(tp) > l->slen)
+			break;
+		tp = node_parent(tp);
 	}
 }

@@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
 	/* if we added to the tail node then we need to update slen */
 	if (l->slen < new->fa_slen) {
 		l->slen = new->fa_slen;
-		leaf_push_suffix(tp, l);
+		node_push_suffix(tp, l);
 	}

 	return 0;
@@ -1495,12 +1510,15 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
 	/* remove the fib_alias from the list */
 	hlist_del_rcu(&old->fa_list);

-	/* if we emptied the list this leaf will be freed and we can sort
-	 * out parent suffix lengths as a part of trie_rebalance
-	 */
+	/* if we emptied the list this leaf will be freed */
 	if (hlist_empty(&l->leaf)) {
 		put_child_root(tp, l->key, NULL);
 		node_free(l);
+		/* only need to update suffixes if this alias was
+		 * possibly the one with the largest suffix in the parent
+		 */
+		if (tp->slen == old->fa_slen)
+			node_pull_suffix(tp);
 		trie_rebalance(t, tp);
 		return;
 	}
@@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;

+			/* push the suffix length to the parent node,
+			 * since a previous leaf removal may have
+			 * caused it to change
+			 */
+			if (pn->slen > pn->pos) {
+				unsigned char slen = update_suffix(pn);
+
+				node_set_suffix(node_parent(pn), slen);
+			}
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);
@@ -1849,6 +1877,16 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;

+			/* push the suffix length to the parent node,
+			 * since a previous leaf removal may have
+			 * caused it to change
+			 */
+			if (pn->slen > pn->pos) {
+				unsigned char slen = update_suffix(pn);
+
+				node_set_suffix(node_parent(pn), slen);
+			}
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);
-- 
2.1.4

^ permalink raw reply related

* [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
From: Nikita Yushchenko @ 2016-11-29 18:35 UTC (permalink / raw)
  To: David S. Miller, Fugang Duan, Troy Kisky, Andrew Lunn,
	Eric Nelson, Philippe Reynes, Johannes Berg, netdev
  Cc: Chris Healy, Fabio Estevam, linux-kernel, Nikita Yushchenko

Fec driver uses Rx buffers of 2k, but programs hardware to limit
incoming frames to 1522 bytes. This raises issues when FEC device
is used with DSA (since DSA tag can make frame larger), and also
disallows manual sending and receiving larger frames.

This patch removes the limitation, allowing Rx size up to entire buffer.
At the same time possible Tx size is increased as well, because hardware
uses the same register field to limit Rx and Tx.

Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 33 +++++++++++--------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 73ac35780611..c09789a71024 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -171,30 +171,12 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 #endif
 #endif /* CONFIG_M5272 */
 
-/* The FEC stores dest/src/type/vlan, data, and checksum for receive packets.
- */
-#define PKT_MAXBUF_SIZE		1522
-#define PKT_MINBUF_SIZE		64
-#define PKT_MAXBLR_SIZE		1536
-
 /* FEC receive acceleration */
 #define FEC_RACC_IPDIS		(1 << 1)
 #define FEC_RACC_PRODIS		(1 << 2)
 #define FEC_RACC_SHIFT16	BIT(7)
 #define FEC_RACC_OPTIONS	(FEC_RACC_IPDIS | FEC_RACC_PRODIS)
 
-/*
- * The 5270/5271/5280/5282/532x RX control register also contains maximum frame
- * size bits. Other FEC hardware does not, so we need to take that into
- * account when setting it.
- */
-#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
-    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM)
-#define	OPT_FRAME_SIZE	(PKT_MAXBUF_SIZE << 16)
-#else
-#define	OPT_FRAME_SIZE	0
-#endif
-
 /* FEC MII MMFR bits definition */
 #define FEC_MMFR_ST		(1 << 30)
 #define FEC_MMFR_OP_READ	(2 << 28)
@@ -847,7 +829,8 @@ static void fec_enet_enable_ring(struct net_device *ndev)
 	for (i = 0; i < fep->num_rx_queues; i++) {
 		rxq = fep->rx_queue[i];
 		writel(rxq->bd.dma, fep->hwp + FEC_R_DES_START(i));
-		writel(PKT_MAXBLR_SIZE, fep->hwp + FEC_R_BUFF_SIZE(i));
+		writel(FEC_ENET_RX_FRSIZE - fep->rx_align,
+		       fep->hwp + FEC_R_BUFF_SIZE(i));
 
 		/* enable DMA1/2 */
 		if (i)
@@ -895,9 +878,17 @@ fec_restart(struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	u32 val;
 	u32 temp_mac[2];
-	u32 rcntl = OPT_FRAME_SIZE | 0x04;
+	u32 rcntl = 0x04;
 	u32 ecntl = 0x2; /* ETHEREN */
 
+	/* The 5270/5271/5280/5282/532x RX control register also contains
+	 * maximum frame * size bits. Other FEC hardware does not.
+	 */
+#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
+    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM)
+	rcntl |= (FEC_ENET_RX_FRSIZE - fep->rx_align) << 16;
+#endif
+
 	/* Whack a reset.  We should wait for this.
 	 * For i.MX6SX SOC, enet use AXI bus, we use disable MAC
 	 * instead of reset MAC itself.
@@ -953,7 +944,7 @@ fec_restart(struct net_device *ndev)
 		else
 			val &= ~FEC_RACC_OPTIONS;
 		writel(val, fep->hwp + FEC_RACC);
-		writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
+		writel(FEC_ENET_RX_FRSIZE - fep->rx_align, fep->hwp + FEC_FTRL);
 	}
 #endif
 
-- 
2.1.4

^ permalink raw reply related

* [mm PATCH 3/3] mm: Add documentation for page fragment APIs
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel
In-Reply-To: <20161129182010.13445.31256.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This is a first pass at trying to add documentation for the page_frag APIs.
They may still change over time but for now I thought I would try to get
these documented so that as more network drivers and stack calls make use
of them we have one central spot to document how they are meant to be used.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 Documentation/vm/page_frags |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
 create mode 100644 Documentation/vm/page_frags

diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags
new file mode 100644
index 000000000000..a6714565dbf9
--- /dev/null
+++ b/Documentation/vm/page_frags
@@ -0,0 +1,42 @@
+Page fragments
+--------------
+
+A page fragment is an arbitrary-length arbitrary-offset area of memory
+which resides within a 0 or higher order compound page.  Multiple
+fragments within that page are individually refcounted, in the page's
+reference counter.
+
+The page_frag functions, page_frag_alloc and page_frag_free, provide a
+simple allocation framework for page fragments.  This is used by the
+network stack and network device drivers to provide a backing region of
+memory for use as either an sk_buff->head, or to be used in the "frags"
+portion of skb_shared_info.
+
+In order to make use of the page fragment APIs a backing page fragment
+cache is needed.  This provides a central point for the fragment allocation
+and tracks allows multiple calls to make use of a cached page.  The
+advantage to doing this is that multiple calls to get_page can be avoided
+which can be expensive at allocation time.  However due to the nature of
+this caching it is required that any calls to the cache be protected by
+either a per-cpu limitation, or a per-cpu limitation and forcing interrupts
+to be disabled when executing the fragment allocation.
+
+The network stack uses two separate caches per CPU to handle fragment
+allocation.  The netdev_alloc_cache is used by callers making use of the
+__netdev_alloc_frag and __netdev_alloc_skb calls.  The napi_alloc_cache is
+used by callers of the __napi_alloc_frag and __napi_alloc_skb calls.  The
+main difference between these two calls is the context in which they may be
+called.  The "netdev" prefixed functions are usable in any context as these
+functions will disable interrupts, while the "napi" prefixed functions are
+only usable within the softirq context.
+
+Many network device drivers use a similar methodology for allocating page
+fragments, but the page fragments are cached at the ring or descriptor
+level.  In order to enable these cases it is necessary to provide a generic
+way of tearing down a page cache.  For this reason __page_frag_cache_drain
+was implemented.  It allows for freeing multiple references from a single
+page via a single call.  The advantage to doing this is that it allows for
+cleaning up the multiple references that were added to a page in order to
+avoid calling get_page per allocation.
+
+Alexander Duyck, Nov 29, 2016.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [mm PATCH 2/3] mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel
In-Reply-To: <20161129182010.13445.31256.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch does two things.

First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or refilling
the cache, not the frags themselves.

Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |    6 +++---
 include/linux/gfp.h                       |    3 +--
 mm/page_alloc.c                           |   13 +++++++------
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 5e66cdeb7ee3..7363503eab80 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,8 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
 				     PAGE_SIZE,
 				     DMA_FROM_DEVICE,
 				     DMA_ATTR_SKIP_CPU_SYNC);
-		__page_frag_drain(buffer_info->page, 0,
-				  buffer_info->pagecnt_bias);
+		__page_frag_cache_drain(buffer_info->page,
+					buffer_info->pagecnt_bias);
 
 		buffer_info->page = NULL;
 	}
@@ -6987,7 +6987,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
 		dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
 				     PAGE_SIZE, DMA_FROM_DEVICE,
 				     DMA_ATTR_SKIP_CPU_SYNC);
-		__page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
+		__page_frag_cache_drain(page, rx_buffer->pagecnt_bias);
 	}
 
 	/* clear contents of rx_buffer */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6238c74e0a01..884080404e24 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,8 +506,7 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 struct page_frag_cache;
-extern void __page_frag_drain(struct page *page, unsigned int order,
-			      unsigned int count);
+extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 extern void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask);
 extern void page_frag_free(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4218795a4694..9559f52e740d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3896,8 +3896,8 @@ void free_pages(unsigned long addr, unsigned int order)
  * drivers to provide a backing region of memory for use as either an
  * sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
  */
-static struct page *__page_frag_refill(struct page_frag_cache *nc,
-				       gfp_t gfp_mask)
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+					     gfp_t gfp_mask)
 {
 	struct page *page = NULL;
 	gfp_t gfp = gfp_mask;
@@ -3917,19 +3917,20 @@ static struct page *__page_frag_refill(struct page_frag_cache *nc,
 	return page;
 }
 
-void __page_frag_drain(struct page *page, unsigned int order,
-		       unsigned int count)
+void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count)) {
+		unsigned int order = compound_order(page);
+
 		if (order == 0)
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
 }
-EXPORT_SYMBOL(__page_frag_drain);
+EXPORT_SYMBOL(__page_frag_cache_drain);
 
 void *page_frag_alloc(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask)
@@ -3940,7 +3941,7 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 
 	if (unlikely(!nc->va)) {
 refill:
-		page = __page_frag_refill(nc, gfp_mask);
+		page = __page_frag_cache_refill(nc, gfp_mask);
 		if (!page)
 			return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [mm PATCH 1/3] mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel
In-Reply-To: <20161129182010.13445.31256.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch renames the page frag functions to be more consistent with other
APIs.  Specifically we place the name page_frag first in the name and then
have either an alloc or free call name that we append as the suffix.  This
makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are technically
no longer a backing interface and instead the front end that is called from
the networking APIs.

The last bit I changed is I rebased page_frag_free to actually function
very similar to the function free_pages, the only real difference now is
the fact that we have to get the page order by calling compound page
instead of having it passed as a part of the function call.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/gfp.h    |    6 +++---
 include/linux/skbuff.h |    2 +-
 mm/page_alloc.c        |   20 ++++++++++++--------
 net/core/skbuff.c      |    8 ++++----
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..6238c74e0a01 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -508,9 +508,9 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 struct page_frag_cache;
 extern void __page_frag_drain(struct page *page, unsigned int order,
 			      unsigned int count);
-extern void *__alloc_page_frag(struct page_frag_cache *nc,
-			       unsigned int fragsz, gfp_t gfp_mask);
-extern void __free_page_frag(void *addr);
+extern void *page_frag_alloc(struct page_frag_cache *nc,
+			     unsigned int fragsz, gfp_t gfp_mask);
+extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c535fbccf2c..95799826a1e7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2471,7 +2471,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 
 static inline void skb_free_frag(void *addr)
 {
-	__free_page_frag(addr);
+	page_frag_free(addr);
 }
 
 void *napi_alloc_frag(unsigned int fragsz);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb668eab5ee4..4218795a4694 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3931,8 +3931,8 @@ void __page_frag_drain(struct page *page, unsigned int order,
 }
 EXPORT_SYMBOL(__page_frag_drain);
 
-void *__alloc_page_frag(struct page_frag_cache *nc,
-			unsigned int fragsz, gfp_t gfp_mask)
+void *page_frag_alloc(struct page_frag_cache *nc,
+		      unsigned int fragsz, gfp_t gfp_mask)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
@@ -3983,19 +3983,23 @@ void *__alloc_page_frag(struct page_frag_cache *nc,
 
 	return nc->va + offset;
 }
-EXPORT_SYMBOL(__alloc_page_frag);
+EXPORT_SYMBOL(page_frag_alloc);
 
 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.
  */
-void __free_page_frag(void *addr)
+void page_frag_free(void *addr)
 {
-	struct page *page = virt_to_head_page(addr);
+	struct page *page;
+
+	if (addr != 0) {
+		VM_BUG_ON(!virt_addr_valid(addr));
+		page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+		__free_pages(page, compound_order(page));
+	}
 }
-EXPORT_SYMBOL(__free_page_frag);
+EXPORT_SYMBOL(page_frag_free);
 
 static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4c96cb18c214..6cf779a9ad4c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -369,7 +369,7 @@ static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 
 	local_irq_save(flags);
 	nc = this_cpu_ptr(&netdev_alloc_cache);
-	data = __alloc_page_frag(nc, fragsz, gfp_mask);
+	data = page_frag_alloc(nc, fragsz, gfp_mask);
 	local_irq_restore(flags);
 	return data;
 }
@@ -391,7 +391,7 @@ static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
+	return page_frag_alloc(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -441,7 +441,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	local_irq_save(flags);
 
 	nc = this_cpu_ptr(&netdev_alloc_cache);
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = page_frag_alloc(nc, len, gfp_mask);
 	pfmemalloc = nc->pfmemalloc;
 
 	local_irq_restore(flags);
@@ -505,7 +505,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(&nc->page, len, gfp_mask);
+	data = page_frag_alloc(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [mm PATCH 0/3] Page fragment updates
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel

This patch series takes care of a few cleanups for the page fragments API.

First we do some renames so that things are much more consistent.  First we
move the page_frag_ portion of the name to the front of the functions
names.  Secondly we split out the cache specific functions from the other
page fragment functions by adding the word "cache" to the name.

Second I did some minor clean-up on the function calls so that they are
more inline with the existing __free_pages calls in terms of how they
operate.

Finally I added a bit of documentation that will hopefully help to explain
some of this.  I plan to revisit this later as we get things more ironed
out in the near future with the changes planned for the DMA setup to
support eXpress Data Path.

---

Alexander Duyck (3):
      mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
      mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
      mm: Add documentation for page fragment APIs

 Documentation/vm/page_frags               |   42 +++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/igb/igb_main.c |    6 ++--
 include/linux/gfp.h                       |    9 +++---
 include/linux/skbuff.h                    |    2 +
 mm/page_alloc.c                           |   33 +++++++++++++----------
 net/core/skbuff.c                         |    8 +++---
 6 files changed, 73 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/vm/page_frags

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Andrew Lunn @ 2016-11-29 18:20 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Uwe Kleine-König, Rob Herring, Frank Rowand,
	Andreas Färber, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Michal Hrusecki, Tomas Hlavacek, Bed??icha Ko??atu,
	Florian Fainelli, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <87oa0yb29b.fsf-BOtmAI95c/+pXNIQCVAXCG0Lkn3mC4nZ0tOlhedn3YvkypF1WZHjJXhe7Zk3YmMvjmZSf7Nhrd8@public.gmane.org>

On Tue, Nov 29, 2016 at 12:54:24PM -0500, Vivien Didelot wrote:
> Hi,
> 
> Uwe Kleine-König <uwe-rXY34ruvC2xidJT2blvkqNi2O/JbrIOy@public.gmane.org> writes:
> 
> > Also it seems wrong to write "marvell,mv88e6085" (only) if I know the
> > hardware is really a "marvell,mv88e6176".
> 
> I agree. It might be complex for a user to dig into the driver in order
> to figure out how the switch ID is read and which compatible to choose.
> 
> I've sent a patch to change this https://lkml.org/lkml/2016/6/8/1198 but
> Andrew had a stronger opinion on compatible strings, which makes sense.
> 
> >> Linus has said he does not like ARM devices because of all the busses
> >> which are not enumerable. Here we have a device which with a little
> >> bit of help we can enumerate. So we should. 
> >
> > If you write
> >
> > 	compatible = "marvell,mv88e6176", "marvell,mv88e6085";
> >
> > you can still enumerate in the same way as before.
> 
> So we don't break the existing DTS files, I like this.
> 
> The driver already prints info about the detected switch. Instead of
> failing at probe, which seems against the notion of compatible and
> breaks the existing behavior, it could report the eventual mismatch?

I'm still against it.

Say we let the driver probe and warn when the compatible string is
wrong. Is this likely to get fixed? Probably not, it is just a
warning, people ignore them.

A few years later, we have accumulated a number of broken device
trees. And suddenly we really do have a Marvell device with a broken
ID in its port register, or more likely, the revision number was not
incremented but there was incompatible register changes. We suddenly
do have to look at the compatible string. But we know many are actually
wrong. How do we know which to trust? We basically have to say, if the
compatible is "marvell,mv88e6042" we trust it, all the others we don't
trust and ignore.

Isn't that madness?

What we have today is verified correct. If this hypothetical
"marvell,mv88e6042" does happen, then no problems, we add a compatible
string for it, and it works.

We should probably add a comment to the mv88e6xxx_of_match array,
explaining how it currently works.

      Andrew
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 2/2] tcp: allow to turn tcp timestamp randomization off
From: Eric Dumazet @ 2016-11-29 18:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <1480434342-12367-2-git-send-email-fw@strlen.de>

On Tue, 2016-11-29 at 16:45 +0100, Florian Westphal wrote:
> Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
> flows, we can deduct the relative qdisc delays (think of fq pacing).
> This should work even if we have one flow per remote peer."
> 
> Having random per flow (or host) offsets doesn't allow that anymore so add
> a way to turn this off.
> 
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Excellent, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net-next 1/2] tcp: randomize tcp timestamp offsets for each connection
From: Eric Dumazet @ 2016-11-29 18:06 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Mirja Kühlewind
In-Reply-To: <1480434342-12367-1-git-send-email-fw@strlen.de>

On Tue, 2016-11-29 at 16:45 +0100, Florian Westphal wrote:
> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
> 
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
> 
> So only two items are left:
>  - add a tsoffset for request sockets
>  - extend the tcp isn generator to also return another 32bit number
>    in addition to the ISN.
> 
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
> 
> Includes fixes from Eric Dumazet.
> 
> Cc: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Very nice work, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net-next v3 2/2] net: phy: bcm7xxx: Plug in support for reading PHY error counters
From: Florian Fainelli @ 2016-11-29 17:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli
In-Reply-To: <20161129175718.20213-1-f.fainelli@gmail.com>

Broadcom BCM7xxx internal PHYs can leverage the Broadcom PHY library
module PHY error counters helper functions, just implement the
appropriate PHY driver function calls to do so. We need to allocate some
storage space for our PHY statistics, and provide it to the Broadcom PHY
library, so do this in a specific probe function, and slightly wrap the
get_stats function call.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/bcm7xxx.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 5b3be4c67be8..aae00bde5980 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -45,6 +45,10 @@
 #define AFE_VDAC_OTHERS_0		MISC_ADDR(0x39, 3)
 #define AFE_HPF_TRIM_OTHERS		MISC_ADDR(0x3a, 0)
 
+struct bcm7xxx_phy_priv {
+	u64	*stats;
+};
+
 static void r_rc_cal_reset(struct phy_device *phydev)
 {
 	/* Reset R_CAL/RC_CAL Engine */
@@ -350,6 +354,33 @@ static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
 	return genphy_restart_aneg(phydev);
 }
 
+static void bcm7xxx_28nm_get_phy_stats(struct phy_device *phydev,
+				       struct ethtool_stats *stats, u64 *data)
+{
+	struct bcm7xxx_phy_priv *priv = phydev->priv;
+
+	bcm_phy_get_stats(phydev, priv->stats, stats, data);
+}
+
+static int bcm7xxx_28nm_probe(struct phy_device *phydev)
+{
+	struct bcm7xxx_phy_priv *priv;
+
+	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	phydev->priv = priv;
+
+	priv->stats = devm_kcalloc(&phydev->mdio.dev,
+				   bcm_phy_get_sset_count(phydev), sizeof(u64),
+				   GFP_KERNEL);
+	if (!priv->stats)
+		return -ENOMEM;
+
+	return 0;
+}
+
 #define BCM7XXX_28NM_GPHY(_oui, _name)					\
 {									\
 	.phy_id		= (_oui),					\
@@ -364,6 +395,10 @@ static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
 	.resume		= bcm7xxx_28nm_resume,				\
 	.get_tunable	= bcm7xxx_28nm_get_tunable,			\
 	.set_tunable	= bcm7xxx_28nm_set_tunable,			\
+	.get_sset_count	= bcm_phy_get_sset_count,			\
+	.get_strings	= bcm_phy_get_strings,				\
+	.get_stats	= bcm7xxx_28nm_get_phy_stats,			\
+	.probe		= bcm7xxx_28nm_probe,				\
 }
 
 #define BCM7XXX_40NM_EPHY(_oui, _name)					\
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 1/2] net: phy: broadcom: Add support code for reading PHY counters
From: Florian Fainelli @ 2016-11-29 17:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli
In-Reply-To: <20161129175718.20213-1-f.fainelli@gmail.com>

Broadcom PHYs expose a number of PHY error counters: receive errors,
false carrier sense, SerDes BER count, local and remote receive errors.
Add support code to allow retrieving these error counters. Since the
Broadcom PHY library code is used by several drivers, make it possible
for them to specify the storage for the software copy of the statistics.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/bcm-phy-lib.c | 70 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/phy/bcm-phy-lib.h |  5 ++++
 include/linux/brcmphy.h       |  3 ++
 3 files changed, 78 insertions(+)

diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index 3156ce6d5861..ab9ad689617c 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -17,6 +17,7 @@
 #include <linux/mdio.h>
 #include <linux/module.h>
 #include <linux/phy.h>
+#include <linux/ethtool.h>
 
 #define MII_BCM_CHANNEL_WIDTH     0x2000
 #define BCM_CL45VEN_EEE_ADV       0x3c
@@ -317,6 +318,75 @@ int bcm_phy_downshift_set(struct phy_device *phydev, u8 count)
 }
 EXPORT_SYMBOL_GPL(bcm_phy_downshift_set);
 
+struct bcm_phy_hw_stat {
+	const char *string;
+	u8 reg;
+	u8 shift;
+	u8 bits;
+};
+
+/* Counters freeze at either 0xffff or 0xff, better than nothing */
+static const struct bcm_phy_hw_stat bcm_phy_hw_stats[] = {
+	{ "phy_receive_errors", MII_BRCM_CORE_BASE12, 0, 16 },
+	{ "phy_serdes_ber_errors", MII_BRCM_CORE_BASE13, 8, 8 },
+	{ "phy_false_carrier_sense_errors", MII_BRCM_CORE_BASE13, 0, 8 },
+	{ "phy_local_rcvr_nok", MII_BRCM_CORE_BASE14, 8, 8 },
+	{ "phy_remote_rcv_nok", MII_BRCM_CORE_BASE14, 0, 8 },
+};
+
+int bcm_phy_get_sset_count(struct phy_device *phydev)
+{
+	return ARRAY_SIZE(bcm_phy_hw_stats);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_get_sset_count);
+
+void bcm_phy_get_strings(struct phy_device *phydev, u8 *data)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
+		memcpy(data + i * ETH_GSTRING_LEN,
+		       bcm_phy_hw_stats[i].string, ETH_GSTRING_LEN);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_get_strings);
+
+#ifndef UINT64_MAX
+#define UINT64_MAX              (u64)(~((u64)0))
+#endif
+
+/* Caller is supposed to provide appropriate storage for the library code to
+ * access the shadow copy
+ */
+static u64 bcm_phy_get_stat(struct phy_device *phydev, u64 *shadow,
+			    unsigned int i)
+{
+	struct bcm_phy_hw_stat stat = bcm_phy_hw_stats[i];
+	int val;
+	u64 ret;
+
+	val = phy_read(phydev, stat.reg);
+	if (val < 0) {
+		ret = UINT64_MAX;
+	} else {
+		val >>= stat.shift;
+		val = val & ((1 << stat.bits) - 1);
+		shadow[i] += val;
+		ret = shadow[i];
+	}
+
+	return ret;
+}
+
+void bcm_phy_get_stats(struct phy_device *phydev, u64 *shadow,
+		       struct ethtool_stats *stats, u64 *data)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
+		data[i] = bcm_phy_get_stat(phydev, shadow, i);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_get_stats);
+
 MODULE_DESCRIPTION("Broadcom PHY Library");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Broadcom Corporation");
diff --git a/drivers/net/phy/bcm-phy-lib.h b/drivers/net/phy/bcm-phy-lib.h
index a117f657c6d7..7c73808cbbde 100644
--- a/drivers/net/phy/bcm-phy-lib.h
+++ b/drivers/net/phy/bcm-phy-lib.h
@@ -42,4 +42,9 @@ int bcm_phy_downshift_get(struct phy_device *phydev, u8 *count);
 
 int bcm_phy_downshift_set(struct phy_device *phydev, u8 count);
 
+int bcm_phy_get_sset_count(struct phy_device *phydev);
+void bcm_phy_get_strings(struct phy_device *phydev, u8 *data);
+void bcm_phy_get_stats(struct phy_device *phydev, u64 *shadow,
+		       struct ethtool_stats *stats, u64 *data);
+
 #endif /* _LINUX_BCM_PHY_LIB_H */
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index f9f8aaf9c943..4f7d8be9ddbf 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -244,6 +244,9 @@
 #define LPI_FEATURE_EN_DIG1000X		0x4000
 
 /* Core register definitions*/
+#define MII_BRCM_CORE_BASE12	0x12
+#define MII_BRCM_CORE_BASE13	0x13
+#define MII_BRCM_CORE_BASE14	0x14
 #define MII_BRCM_CORE_BASE1E	0x1E
 #define MII_BRCM_CORE_EXPB0	0xB0
 #define MII_BRCM_CORE_EXPB1	0xB1
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 0/2] net: phy: broadcom: Support for PHY counters
From: Florian Fainelli @ 2016-11-29 17:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli

Hi all,

This patch series adds support for reading the Broadcom PHYs internal counters.

Changes in v3:

- fixed the allocation of priv->stats in bcm7xxx

Changes in v2:

- fixed modular build reported by kbuild

- constify array of stats

Florian Fainelli (2):
  net: phy: broadcom: Add support code for reading PHY counters
  net: phy: bcm7xxx: Plug in support for reading PHY error counters

 drivers/net/phy/bcm-phy-lib.c | 70 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/phy/bcm-phy-lib.h |  5 ++++
 drivers/net/phy/bcm7xxx.c     | 35 ++++++++++++++++++++++
 include/linux/brcmphy.h       |  3 ++
 4 files changed, 113 insertions(+)

-- 
2.9.3

^ permalink raw reply

* [PATCH] net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
From: Lorenzo Colitti @ 2016-11-29 17:56 UTC (permalink / raw)
  To: netdev; +Cc: davem, erezsh, Lorenzo Colitti

Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
protocols.") made __build_flow_key call sock_net(sk) to determine
the network namespace of the passed-in socket. This crashes if sk
is NULL.

Fix this by getting the network namespace from the skb instead.

Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d37fc6f..6402d74 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -531,13 +531,14 @@ static void __build_flow_key(const struct net *net, struct flowi4 *fl4,
 static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
 			       const struct sock *sk)
 {
+	const struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph = ip_hdr(skb);
 	int oif = skb->dev->ifindex;
 	u8 tos = RT_TOS(iph->tos);
 	u8 prot = iph->protocol;
 	u32 mark = skb->mark;

-	__build_flow_key(sock_net(sk), fl4, sk, iph, oif, tos, prot, mark, 0);
+	__build_flow_key(net, fl4, sk, iph, oif, tos, prot, mark, 0);
 }

 static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox