Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/3] adding new glue driver dwmac-dwc-qos-eth
From: Joao Pinto @ 2017-01-04 13:46 UTC (permalink / raw)
  To: Niklas Cassel, Joao Pinto, davem; +Cc: larper, swarren, treding, netdev
In-Reply-To: <34837f5b-afb8-92df-f12e-2fa5514558be@axis.com>

Às 1:43 PM de 1/4/2017, Niklas Cassel escreveu:
> Let's see if patchwork is smart enough to add the tag to the whole series.
> 
> Tested-by: Niklas Cassel <niklas.cassel@axis.com>

I got this compile error due to dwmac-socfpga twist:

config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386

All errors (new ones prefixed by >>):

   drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c: In function
'socfpga_dwmac_probe':
>> drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c:344:28: error: 'struct
stmmac_priv' has no member named 'stmmac_rst'
     dwmac->stmmac_rst = stpriv->stmmac_rst;

I am going to fix socfpga and submit v3 :). Hopefully everything' got ok!

Thanks

> 
> On 01/04/2017 12:48 PM, Joao Pinto wrote:
>> This patch set contains the porting of the synopsys/dwc_eth_qos.c driver
>> to the stmmac structure. This operation resulted in the creation of a new
>> platform glue driver called dwmac-dwc-qos-eth which was based in the
>> dwc_eth_qos as is.
>>
>> dwmac-dwc-qos-eth inherited dwc_eth_qos DT bindings, to assure that current
>> and old users can continue to use it as before. We can see this driver as
>> being deprecated, since all new development will be done in stmmac.
>>
>> Please check each patch for implementation details.
>>
>> Joao Pinto (3):
>>   stmmac: adding DT parameter for LPI tx clock gating
>>   stmmac: move stmmac_clk, pclk, clk_ptp_ref and stmmac_rst to platform
>>     structure
>>   stmmac: adding new glue driver dwmac-dwc-qos-eth
>>
>>  .../bindings/net/snps,dwc-qos-ethernet.txt         |   3 +
>>  Documentation/devicetree/bindings/net/stmmac.txt   |   2 +
>>  drivers/net/ethernet/stmicro/stmmac/Kconfig        |   9 +
>>  drivers/net/ethernet/stmicro/stmmac/Makefile       |   1 +
>>  drivers/net/ethernet/stmicro/stmmac/common.h       |   3 +-
>>  .../ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c    | 200 +++++++++++++++++++++
>>  .../net/ethernet/stmicro/stmmac/dwmac1000_core.c   |   5 +-
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h       |   1 +
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c  |   6 +-
>>  drivers/net/ethernet/stmicro/stmmac/stmmac.h       |   5 -
>>  .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c   |   4 +-
>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  85 ++-------
>>  .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |  65 ++++++-
>>  include/linux/stmmac.h                             |   6 +
>>  14 files changed, 314 insertions(+), 81 deletions(-)
>>  create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
>>
> 

^ permalink raw reply

* Re: [RFC 3/4] net-next: dsa: Add support for multiple cpu ports.
From: Andrew Lunn @ 2017-01-04 14:01 UTC (permalink / raw)
  To: John Crispin; +Cc: David S. Miller, Florian Fainelli, Vivien Didelot, netdev
In-Reply-To: <1483515484-21793-4-git-send-email-john@phrozen.org>

On Wed, Jan 04, 2017 at 08:38:03AM +0100, John Crispin wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> 
> Some boards have two CPU interfaces connected to the switch, e.g. WiFi
> access points, with 1 port labeled WAN, 4 ports labeled lan1-lan4, and
> two port connected to the SoC.
> 
> This patch extends DSA to allows both CPU ports to be used. The "cpu"
> node in the DSA tree can now have a phandle to the host interface it
> connects to. Each user port can have a phandle to a cpu port which
> should be used for traffic between the port and the CPU. Thus simple
> load sharing over the two CPU ports can be achieved.
> 
> Signed-off-by: John Crispin <john@phrozen.org>
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> ---
>  include/net/dsa.h  |   21 ++++++++++++++++++++-
>  net/dsa/dsa2.c     |   36 ++++++++++++++++++++++++++++++------
>  net/dsa/dsa_priv.h |    5 +++++
>  net/dsa/slave.c    |   27 ++++++++++++++++-----------
>  4 files changed, 71 insertions(+), 18 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index b122196..f68180b 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -60,6 +60,8 @@ struct dsa_chip_data {
>  	 */
>  	char		*port_names[DSA_MAX_PORTS];
>  	struct device_node *port_dn[DSA_MAX_PORTS];
> +	struct net_device *port_ethernet[DSA_MAX_PORTS];
> +	int		port_cpu[DSA_MAX_PORTS];

Hi John

My proof of concept patches have "bit rotted" a bit. When implementing
dsa2, i removed the use of dsa_chip_data, aka cd, from all the drivers
and the new binding does not use it at all. I don't want to add it
back again. When Florain removes the old binding in 6 months time, i
expect dsa_chip_data and dsa_platform_data will be removed.

I would be tempted to put these two members into the dsa_port
structure.

	Andrew

^ permalink raw reply

* Re: [PATCH] net-next: korina: Fix NAPI versus resources freeing
From: Florian Fainelli @ 2017-01-04 14:24 UTC (permalink / raw)
  To: John Crispin, David S. Miller; +Cc: netdev
In-Reply-To: <1483522278-36891-1-git-send-email-john@phrozen.org>

On 01/04/2017 01:31 AM, John Crispin wrote:
> From: Florian Fainelli <f.fainelli@gmail.com>
> 
> Commit beb0babfb77e ("korina: disable napi on close and restart")
> introduced calls to napi_disable() that were missing before,
> unfortunately this leaves a small window during which NAPI has a chance
> to run, yet we just freed resources since korina_free_ring() has been
> called:
> 
> Fix this by disabling NAPI first then freeing resource, and make sure
> that we also cancel the restart taks before doing the resource freeing.
> 
> Fixes: beb0babfb77e ("korina: disable napi on close and restart")

I submitted this patch to 'net' already, but thanks for doing it!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 1/2] tools: psock_lib: tighten conditions checked in sock_setfilter
From: Willem de Bruijn @ 2017-01-04 14:27 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: linux-kselftest, Network Development, Daniel Borkmann,
	Willem de Bruijn, David Miller, shuah
In-Reply-To: <04d240e80be0455040f7b641e2c59ff86754edd1.1483482971.git.sowmini.varadhan@oracle.com>

On Tue, Jan 3, 2017 at 6:27 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> The bpf_prog used in sock_setfilter() only attempts to check for
> ip pktlen, and verifies that the contents of the 80'th packet in

80th byte.

> the ethernet frame is 'a' or 'b'. Offsets used for checking these
> conditions are incorrectly computed.

That's a bit strong. It's not incorrect, in that it just intended to
match the packets as generated by the test -- which it does.

> Thus many non-udp packets
> could incorrectly pass through this filter and cause the test to
> fail.

Absolutely. The test has many potential false positives. Thanks
for hardening it.

I run these kinds of tests in network namespaces to rule out
such flakiness from background traffic.

> This commit tightens the conditions checked by the filter so
> that only UDP/IPv4 packets with the matching length and test-character
> will be permitted by the filter. The filter has been cleaned up
> to explicitly use the BPF macros to make it more readable.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>

The few comments about the commit wording are no reason for a v2.
Thanks for improving the test.

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 2/2] tools: psock_tpacket: verify that packet was received on lo before counting it
From: Willem de Bruijn @ 2017-01-04 14:30 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: linux-kselftest, Network Development, Daniel Borkmann,
	Willem de Bruijn, David Miller, shuah
In-Reply-To: <3451a2008d953f33d6576a35eaefecad883eaeb5.1483482971.git.sowmini.varadhan@oracle.com>

On Tue, Jan 3, 2017 at 6:27 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Packets from any/all interfaces may be queued up on the PF_PACKET socket
> before it is bound to the loopback interface by psock_tpacket, and
> when these are passed up by the kernel, they should not be counted
> toward the conditions needed to pass/fail the Rx tests.

The common and simpler solution to this problem is to open the socket
with protocol 0 to reject all packets, add the BPF filter and only then bind
with sll_ifindex set to lo. That way no false positives can arrive.

^ permalink raw reply

* Re: [PATCH v2] cpsw: ethtool: add support for getting/setting EEE registers
From: Florian Fainelli @ 2017-01-04 14:33 UTC (permalink / raw)
  To: Giuseppe CAVALLARO, Andrew Lunn, Yegor Yefremov
  Cc: netdev, linux-omap@vger.kernel.org, Grygorii Strashko,
	N, Mugunthan V, Rami Rosen, Fabrice GASNIER
In-Reply-To: <5809df57-997b-3531-838f-8c5a61605fe5@gmail.com>

On 12/02/2016 09:48 AM, Florian Fainelli wrote:
>>> Peppe, any thoughts on this?
>>
>> I share what you say.
>>
>> In sum, the EEE management inside the stmmac is:
>>
>> - the driver looks at own HW cap register if EEE is supported
>>
>>     (indeed the user could keep disable EEE if bugged on some HW
>>      + Alex, Fabrice: we had some patches for this to propose where we
>>              called the phy_ethtool_set_eee to disable feature at phy
>>              level
>>
>> - then the stmmac asks PHY layer to understand if transceiver and
>>   partners are EEE capable.
>>
>> - If all matches the EEE is actually initialized.
>>
>> the logic above should be respected when use ethtool, hmm, I will
>> check the stmmac_ethtool_op_set_eee asap.
>>
>> Hoping this is useful
> 
> This is definitively useful, the only part that I am struggling to
> understand in phy_init_eee() is this:
> 
>                 eee_adv = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
>                                                 MDIO_MMD_AN);
>                 if (eee_adv <= 0)
>                         goto eee_exit_err;
> 
> if we are not already advertising EEE in the PHY's MMIO_MMD_AN page, by
> the time we call phy_init_eee(), then we cannot complete the EEE
> configuration at the PHY level, and presumably we should abort the EEE
> configuration at the MAC level.
> 
> While this condition makes sense if e.g: you are re-negotiating the link
> with your partner for instance and if EEE was already advertised, the
> very first time this function is called, it seems to be like we should
> skip the check, because phy_init_eee() should actually tell us if, as a
> result of a successful check, we should be setting EEE as something we
> advertise?
> 
> Do you remember what was the logic behind this check when you added it?

Peppe, can you remember why phy_init_eee() was written in a way that you
need to have already locally advertised EEE for the function to
successfully return? Thank you!
-- 
Florian

^ permalink raw reply

* [PATCH] stmmac: Enable Clause 45 PHYs in GAMC4
From: Joao Pinto @ 2017-01-04 14:35 UTC (permalink / raw)
  To: davem; +Cc: hock.leong.kweh, netdev, Joao Pinto

The eQOS IP Core (best known in stmmac as gmac4) has a register that must be
set if using a Clause 45 PHY. If this register is not set, the PHY won't work.
This patch will have no impact in setups using Clause 22 PHYs.

Signed-off-by: Joao Pinto <jpinto@synopsys.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index b0344c2..676ae3c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -41,6 +41,7 @@
 #define MII_GMAC4_GOC_SHIFT		2
 #define MII_GMAC4_WRITE			(1 << MII_GMAC4_GOC_SHIFT)
 #define MII_GMAC4_READ			(3 << MII_GMAC4_GOC_SHIFT)
+#define MII_CLAUSE45_PHY		(1 << 1)
 
 static int stmmac_mdio_busy_wait(void __iomem *ioaddr, unsigned int mii_addr)
 {
@@ -125,7 +126,7 @@ static int stmmac_mdio_write(struct mii_bus *bus, int phyaddr, int phyreg,
 	value |= (priv->clk_csr << priv->hw->mii.clk_csr_shift)
 		& priv->hw->mii.clk_csr_mask;
 	if (priv->plat->has_gmac4)
-		value |= MII_GMAC4_WRITE;
+		value |= MII_GMAC4_WRITE | MII_CLAUSE45_PHY;
 	else
 		value |= MII_WRITE;
 
-- 
2.9.3

^ permalink raw reply related

* Re: [RFC 4/4] net-next: dsa: qca8k: add support for multiple cpu ports
From: Andrew Lunn @ 2017-01-04 14:12 UTC (permalink / raw)
  To: John Crispin; +Cc: David S. Miller, Florian Fainelli, Vivien Didelot, netdev
In-Reply-To: <1483515484-21793-5-git-send-email-john@phrozen.org>

> +	/* Setup the cpu ports */
> +	for (i = 0; i < DSA_MAX_PORTS; i++) {
> +		struct net_device *netdev;
> +		int phy_mode = -1;
> +
> +		if (!dsa_is_cpu_port(ds, i))
> +			continue;
> +
> +		netdev = ds->dst->pd->chip->port_ethernet[i];
> +		if (!netdev) {
> +			pr_err("Can't find netdev for port%d\n", i);
> +			return -ENODEV;
> +		}
> +
> +		/* Initialize CPU port pad mode (xMII type, delays...) */
> +		phy_mode = of_get_phy_mode(netdev->dev.parent->of_node);
> +		if (phy_mode < 0) {
> +			pr_err("Can't find phy-mode for port:%d\n", i);
> +			return phy_mode;
> +		}

Hi John

We try to avoid having the switch drivers parse the DSA device
tree. There is code in the DSA core to do what you want here.

dsa.c: dsa_cpu_dsa_setup() will parse the phy-mode property, if you
have a fixed-phy node in the "cpu" or "dsa" node. It will then call
the drivers adjust_list_() function, which can then set the xMII type
etc.

The problem might be how to make use of this without breaking
backwards compatibility with older device tree blobs.

	  Andrew

^ permalink raw reply

* Re: [RFC 3/4] net-next: dsa: Add support for multiple cpu ports.
From: Florian Fainelli @ 2017-01-04 14:40 UTC (permalink / raw)
  To: Andrew Lunn, John Crispin; +Cc: David S. Miller, Vivien Didelot, netdev
In-Reply-To: <20170104140105.GL10768@lunn.ch>



On 01/04/2017 06:01 AM, Andrew Lunn wrote:
> On Wed, Jan 04, 2017 at 08:38:03AM +0100, John Crispin wrote:
>> From: Andrew Lunn <andrew@lunn.ch>
>>
>> Some boards have two CPU interfaces connected to the switch, e.g. WiFi
>> access points, with 1 port labeled WAN, 4 ports labeled lan1-lan4, and
>> two port connected to the SoC.
>>
>> This patch extends DSA to allows both CPU ports to be used. The "cpu"
>> node in the DSA tree can now have a phandle to the host interface it
>> connects to. Each user port can have a phandle to a cpu port which
>> should be used for traffic between the port and the CPU. Thus simple
>> load sharing over the two CPU ports can be achieved.
>>
>> Signed-off-by: John Crispin <john@phrozen.org>
>> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
>> ---
>>  include/net/dsa.h  |   21 ++++++++++++++++++++-
>>  net/dsa/dsa2.c     |   36 ++++++++++++++++++++++++++++++------
>>  net/dsa/dsa_priv.h |    5 +++++
>>  net/dsa/slave.c    |   27 ++++++++++++++++-----------
>>  4 files changed, 71 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/net/dsa.h b/include/net/dsa.h
>> index b122196..f68180b 100644
>> --- a/include/net/dsa.h
>> +++ b/include/net/dsa.h
>> @@ -60,6 +60,8 @@ struct dsa_chip_data {
>>  	 */
>>  	char		*port_names[DSA_MAX_PORTS];
>>  	struct device_node *port_dn[DSA_MAX_PORTS];
>> +	struct net_device *port_ethernet[DSA_MAX_PORTS];
>> +	int		port_cpu[DSA_MAX_PORTS];
> 
> Hi John
> 
> My proof of concept patches have "bit rotted" a bit. When implementing
> dsa2, i removed the use of dsa_chip_data, aka cd, from all the drivers
> and the new binding does not use it at all. I don't want to add it
> back again. When Florain removes the old binding in 6 months time, i
> expect dsa_chip_data and dsa_platform_data will be removed.

I am actually in the process of cleaning up my patches that add
platform_data support to net/dsa/dsa2.c this time with an user, and
hopefully a couple more after that, but that maintains the existing
struct dsa_chip_data as-is, no new additions are required, and this
would be purely for non-DT enabled platforms anyway.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 2/2] tools: psock_tpacket: verify that packet was received on lo before counting it
From: Sowmini Varadhan @ 2017-01-04 14:44 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: linux-kselftest, Network Development, Daniel Borkmann,
	Willem de Bruijn, David Miller, shuah
In-Reply-To: <CAF=yD-+d-cgE1dxCcrfNZfgVk4_wjYMHHoQ=BN7bnJ-txbM-+Q@mail.gmail.com>

On (01/04/17 09:30), Willem de Bruijn wrote:
> 
> The common and simpler solution to this problem is to open the socket
> with protocol 0 to reject all packets, add the BPF filter and only then bind
> with sll_ifindex set to lo. That way no false positives can arrive.

Yes, I thought of that too (and I've seen that done in one commercial
implementation), but given that tpacket nicely returns the incoming
interface, I figured, why not use the test prog to use this (thus
verifying it, and also showing how to use it)

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next 2/2] tools: psock_tpacket: verify that packet was received on lo before counting it
From: Willem de Bruijn @ 2017-01-04 15:03 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: linux-kselftest, Network Development, Daniel Borkmann,
	Willem de Bruijn, David Miller, shuah
In-Reply-To: <20170104144424.GD9641@oracle.com>

On Wed, Jan 4, 2017 at 9:44 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/04/17 09:30), Willem de Bruijn wrote:
>>
>> The common and simpler solution to this problem is to open the socket
>> with protocol 0 to reject all packets, add the BPF filter and only then bind
>> with sll_ifindex set to lo. That way no false positives can arrive.
>
> Yes, I thought of that too (and I've seen that done in one commercial
> implementation), but given that tpacket nicely returns the incoming
> interface, I figured, why not use the test prog to use this (thus
> verifying it, and also showing how to use it)

This approach is less restrictive. It still allows incorrect packets
to be enqueued in the time between the socket call and attaching the
bpf filter. Also, if packets are restricted to a single packet, using
bind with sll_ifindex is simpler.

^ permalink raw reply

* [PATCH] phy state machine: failsafe leave invalid RUNNING state
From: Zefir Kurtisi @ 2017-01-04 15:04 UTC (permalink / raw)
  To: netdev; +Cc: f.fainelli, andrew

While in RUNNING state, phy_state_machine() checks for link changes by
comparing phydev->link before and after calling phy_read_status().
This works as long as it is guaranteed that phydev->link is never
changed outside the phy_state_machine().

If in some setups this happens, it causes the state machine to miss
a link loss and remain RUNNING despite phydev->link being 0.

This has been observed running a dsa setup with a process continuously
polling the link states over ethtool each second (SNMPD RFC-1213
agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
call phy_read_status() and with that modify the link status - and
with that bricking the phy state machine.

This patch adds a fail-safe check while in RUNNING, which causes to
move to CHANGELINK when the link is gone and we are still RUNNING.

Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
---
 drivers/net/phy/phy.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 28548af..0f9a61e 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -966,6 +966,15 @@ void phy_state_machine(struct work_struct *work)
 			if (old_link != phydev->link)
 				phydev->state = PHY_CHANGELINK;
 		}
+		/*
+		 * Failsafe: check that nobody set phydev->link=0 between two
+		 * poll cycles, otherwise we won't leave RUNNING state as long
+		 * as link remains down.
+		 */
+		if (!phydev->link && phydev->state == PHY_RUNNING) {
+			phydev->state = PHY_CHANGELINK;
+			dev_warn(&phydev->dev, "no link in PHY_RUNNING\n");
+		}
 		break;
 	case PHY_CHANGELINK:
 		err = phy_read_status(phydev);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] tcp: provide tx timestamps for partial writes
From: Soheil Hassas Yeganeh @ 2017-01-04 15:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Willem de Bruijn, Yuchung Cheng,
	Eric Dumazet, Neal Cardwell, Martin KaFai Lau
In-Reply-To: <1483534533.4337.19.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Jan 4, 2017 at 7:55 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Tue, 2017-01-03 at 10:22 -0500, Soheil Hassas Yeganeh wrote:
> > On Mon, Jan 2, 2017 at 3:23 PM, Soheil Hassas Yeganeh <soheil@google.com> wrote:
> > > On Mon, Jan 2, 2017 at 3:20 PM, Soheil Hassas Yeganeh
> > > <soheil.kdev@gmail.com> wrote:
> > >> From: Soheil Hassas Yeganeh <soheil@google.com>
> > >>
> > >> For TCP sockets, tx timestamps are only captured when the user data
> > >> is successfully and fully written to the socket. In many cases,
> > >> however, TCP writes can be partial for which no timestamp is
> > >> collected.
> > >>
> > >> Collect timestamps when the user data is partially copied into
> > >> the socket.
> > >>
> > >> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> > >> Cc: Willem de Bruijn <willemb@google.com>
> > >> Cc: Yuchung Cheng <ycheng@google.com>
> > >> Cc: Eric Dumazet <edumazet@google.com>
> > >> Cc: Neal Cardwell <ncardwell@google.com>
> > >> Cc: Martin KaFai Lau <kafai@fb.com>
> > >> ---
> > >>  net/ipv4/tcp.c | 8 ++++++--
> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > >> index 2e3807d..c207b16 100644
> > >> --- a/net/ipv4/tcp.c
> > >> +++ b/net/ipv4/tcp.c
> > >> @@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> > >>         return copied;
> > >>
> > >>  do_error:
> > >> -       if (copied)
> > >> +       if (copied) {
> > >> +               tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
> > >>                 goto out;
> > >> +       }
> > >>  out_err:
> > >>         /* make sure we wake any epoll edge trigger waiter */
> > >>         if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
> > >> @@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> > >>         }
> > >>
> > >>  do_error:
> > >> -       if (copied + copied_syn)
> > >> +       if (copied + copied_syn) {
> > >> +               tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
> >
> > Thanks to Willem for noting that this should be sockc.tsflags and not
> > sk->sk_tsflags. I'll send V2 to fix.
>
> Also, why not factorizing a bit and have a single point calling
> tcp_tx_timestamp() ?
>
> This would ease code review quite a bit.

Thanks Eric! will do in V2.

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4a044964da6670829e5c47fef52d2cd76360b59f..11357f3bd1f82fa29129dd3ecf4d270feb4a6b1d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -958,10 +958,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>                 copied += copy;
>                 offset += copy;
>                 size -= copy;
> -               if (!size) {
> -                       tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
> +               if (!size)
>                         goto out;
> -               }
>
>                 if (skb->len < size_goal || (flags & MSG_OOB))
>                         continue;
> @@ -987,8 +985,11 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>         }
>
>  out:
> -       if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> -               tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
> +       if (copied) {
> +               tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
> +               if (!(flags & MSG_SENDPAGE_NOTLAST))
> +                       tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
> +       }
>         return copied;
>
>  do_error:
> @@ -1281,7 +1282,6 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>
>                 copied += copy;
>                 if (!msg_data_left(msg)) {
> -                       tcp_tx_timestamp(sk, sockc.tsflags, skb);
>                         if (unlikely(flags & MSG_EOR))
>                                 TCP_SKB_CB(skb)->eor = 1;
>                         goto out;
> @@ -1312,8 +1312,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>         }
>
>  out:
> -       if (copied)
> +       if (copied) {
> +               tcp_tx_timestamp(sk, sockc.tsflags, tcp_write_queue_tail(sk));
>                 tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
> +       }
>  out_nopush:
>         release_sock(sk);
>         return copied + copied_syn;
>
>
>

^ permalink raw reply

* [PATCH net] sfc: don't report RX hash keys to ethtool when RSS wasn't enabled
From: Edward Cree @ 2017-01-04 15:10 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: bkenward, netdev

If we failed to set up RSS on EF10 (e.g. because firmware declared
 RX_RSS_LIMITED), ethtool --show-nfc $dev rx-flow-hash ... should report
 no fields, rather than confusingly reporting what fields we _would_ be
 hashing on if RSS was working.

Fixes: dcb4123cbec0 ("sfc: disable RSS when unsupported")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c       | 3 ++-
 drivers/net/ethernet/sfc/ethtool.c    | 2 ++
 drivers/net/ethernet/sfc/net_driver.h | 2 ++
 drivers/net/ethernet/sfc/siena.c      | 1 +
 4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index de2947c..5eb0e68 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -1323,7 +1323,8 @@ static int efx_ef10_init_nic(struct efx_nic *efx)
 	}
 
 	/* don't fail init if RSS setup doesn't work */
-	efx->type->rx_push_rss_config(efx, false, efx->rx_indir_table);
+	rc = efx->type->rx_push_rss_config(efx, false, efx->rx_indir_table);
+	efx->rss_active = (rc == 0);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index 87bdc56..18ebaea 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -975,6 +975,8 @@ efx_ethtool_get_rxnfc(struct net_device *net_dev,
 
 	case ETHTOOL_GRXFH: {
 		info->data = 0;
+		if (!efx->rss_active) /* No RSS */
+			return 0;
 		switch (info->flow_type) {
 		case UDP_V4_FLOW:
 			if (efx->rx_hash_udp_4tuple)
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 1a635ce..1c62c1a 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -860,6 +860,7 @@ struct vfdi_status;
  * @rx_hash_key: Toeplitz hash key for RSS
  * @rx_indir_table: Indirection table for RSS
  * @rx_scatter: Scatter mode enabled for receives
+ * @rss_active: RSS enabled on hardware
  * @rx_hash_udp_4tuple: UDP 4-tuple hashing enabled
  * @int_error_count: Number of internal errors seen recently
  * @int_error_expire: Time at which error count will be expired
@@ -998,6 +999,7 @@ struct efx_nic {
 	u8 rx_hash_key[40];
 	u32 rx_indir_table[128];
 	bool rx_scatter;
+	bool rss_active;
 	bool rx_hash_udp_4tuple;
 
 	unsigned int_error_count;
diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index a3901bc..4e54e5d 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -403,6 +403,7 @@ static int siena_init_nic(struct efx_nic *efx)
 	efx_writeo(efx, &temp, FR_AZ_RX_CFG);
 
 	siena_rx_push_rss_config(efx, false, efx->rx_indir_table);
+	efx->rss_active = true;
 
 	/* Enable event logging */
 	rc = efx_mcdi_log_ctrl(efx, true, false, 0);

^ permalink raw reply related

* Re: [PATCH] phy state machine: failsafe leave invalid RUNNING state
From: Florian Fainelli @ 2017-01-04 15:13 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew
In-Reply-To: <1483542298-9747-1-git-send-email-zefir.kurtisi@neratec.com>



On 01/04/2017 07:04 AM, Zefir Kurtisi wrote:
> While in RUNNING state, phy_state_machine() checks for link changes by
> comparing phydev->link before and after calling phy_read_status().
> This works as long as it is guaranteed that phydev->link is never
> changed outside the phy_state_machine().
> 
> If in some setups this happens, it causes the state machine to miss
> a link loss and remain RUNNING despite phydev->link being 0.
> 
> This has been observed running a dsa setup with a process continuously
> polling the link states over ethtool each second (SNMPD RFC-1213
> agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
> causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
> call phy_read_status() and with that modify the link status - and
> with that bricking the phy state machine.

That's the interesting part of the analysis, how does this brick the PHY
state machine? Is the PHY driver changing the link status in the
read_status callback that it implements?

> 
> This patch adds a fail-safe check while in RUNNING, which causes to
> move to CHANGELINK when the link is gone and we are still RUNNING.
> 
> Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
> ---
>  drivers/net/phy/phy.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 28548af..0f9a61e 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -966,6 +966,15 @@ void phy_state_machine(struct work_struct *work)
>  			if (old_link != phydev->link)
>  				phydev->state = PHY_CHANGELINK;
>  		}
> +		/*
> +		 * Failsafe: check that nobody set phydev->link=0 between two
> +		 * poll cycles, otherwise we won't leave RUNNING state as long
> +		 * as link remains down.
> +		 */
> +		if (!phydev->link && phydev->state == PHY_RUNNING) {
> +			phydev->state = PHY_CHANGELINK;
> +			dev_warn(&phydev->dev, "no link in PHY_RUNNING\n");
> +		}
>  		break;
>  	case PHY_CHANGELINK:
>  		err = phy_read_status(phydev);
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 2/2] tools: psock_tpacket: verify that packet was received on lo before counting it
From: Sowmini Varadhan @ 2017-01-04 15:13 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: linux-kselftest, Network Development, Daniel Borkmann,
	Willem de Bruijn, David Miller, shuah
In-Reply-To: <CAF=yD-KUw8M+RGnN2eXLVgv836F6mxd4+b0h=2CP6X7mv+ScWg@mail.gmail.com>

On (01/04/17 10:03), Willem de Bruijn wrote:
> 
> This approach is less restrictive. It still allows incorrect packets
> to be enqueued in the time between the socket call and attaching the
> bpf filter. Also, if packets are restricted to a single packet, using
> bind with sll_ifindex is simpler.

Do you want me to change this to first set up pfsocket() with
proto 0, then set up filter, and then bind_ring() to the desired
ifindex with ETH_P_ALL?

I can spin out v2 (and if I have to that, I can also fix the
comments) if you feel strongly about it.

--Sowmini

^ permalink raw reply

* Re: [RFC 3/4] net-next: dsa: Add support for multiple cpu ports.
From: Andrew Lunn @ 2017-01-04 15:22 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: John Crispin, David S. Miller, Vivien Didelot, netdev
In-Reply-To: <5a5c86c6-7a19-8720-7772-3d2d5b2633c5@gmail.com>

> I am actually in the process of cleaning up my patches that add
> platform_data support to net/dsa/dsa2.c this time with an user, and
> hopefully a couple more after that, but that maintains the existing
> struct dsa_chip_data as-is, no new additions are required, and this
> would be purely for non-DT enabled platforms anyway.

Hi Florian

I was hoping we could avoid this complexity. But if there is a real
need, the platform cannot be converted to device tree, we can add it.

	Andrew

^ permalink raw reply

* Re: [PATCH] phy state machine: failsafe leave invalid RUNNING state
From: Florian Fainelli @ 2017-01-04 15:30 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew
In-Reply-To: <82ffbb43-9345-c47d-596c-73c175ac7e7f@neratec.com>



On 01/04/2017 07:27 AM, Zefir Kurtisi wrote:
> On 01/04/2017 04:13 PM, Florian Fainelli wrote:
>>
>>
>> On 01/04/2017 07:04 AM, Zefir Kurtisi wrote:
>>> While in RUNNING state, phy_state_machine() checks for link changes by
>>> comparing phydev->link before and after calling phy_read_status().
>>> This works as long as it is guaranteed that phydev->link is never
>>> changed outside the phy_state_machine().
>>>
>>> If in some setups this happens, it causes the state machine to miss
>>> a link loss and remain RUNNING despite phydev->link being 0.
>>>
>>> This has been observed running a dsa setup with a process continuously
>>> polling the link states over ethtool each second (SNMPD RFC-1213
>>> agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
>>> causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
>>> call phy_read_status() and with that modify the link status - and
>>> with that bricking the phy state machine.
>>
>> That's the interesting part of the analysis, how does this brick the PHY
>> state machine? Is the PHY driver changing the link status in the
>> read_status callback that it implements?
>>
> phydev->read_status points to genphy_read_status(), where the first call goes to
> genphy_update_link() which updates the link status.
> 
> Thereafter phy_state_machine():RUNNING won't be able to detect the link loss
> anymore unless the link state changes again.
> 
> 
> I was trying to figure out if there is a rule that forbids changing phydev->link
> from outside the state machine, but found several places where it happens (either
> directly, or over genphy_read_status() or over genphy_update_link()).
> 
> Curious how this did not show up before, since within the dsa setup it is very
> easy to trigger:
> a) physically disconnect link
> b) within one second run ethtool ethX

You need to be more specific here about what "the dsa setup" is, drivers
involved, which ports of the switch you are seeing this with (user
facing, CPU port, DSA port?) etc.
-- 
Florian

^ permalink raw reply

* [PATCH nf-next 0/4] netfilter: skbuff: merge nfctinfo bits and nfct pointer
From: Florian Westphal @ 2017-01-04 15:35 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev

[ CC netdev, touches sk_buff in patch #4.
  Does anyone know if there are arches where alignof(u64) < 8?
  If so, this series would require more work ]

Whenever we fetch skb conntrack info, we need to access two
distinct cache lines in sk_buff, #2 (nfct pointer) and #3
(nfctinfo bits).  This series removes nfctinfo and joins it
with the data pointer in a single ulong.

We have 3 nfctinfo bits, the slab cache used for nf_conn objects
guarantees at least 8 byte alignment so there is no overlap.

For the conntrack templates the situaton isn't obvious to me,
these get allocated via kmalloc which guarantees ARCH_KMALLOC_MINALIGN
(alignof(unsigned long long) so that begs the question if that is >= 8
on all arches or not.  I added a BUILD_BUG_ON test to catch
ARCH_KMALLOC_MINALIGN < 8, just in case.

If that triggers we'd need to align by hand in nf_ct_tmpl_alloc()
and store the padding in the conntrack somewhere.

But as its ugly I did not do this.

A followup series to this one will resurrect an old patch from
Pablo that adds an 'untracked' ctinfo status, this then allows
to get rid of the conntrack template object (which in turn avoids
get/put atomic ops for untracked skbs).

 include/linux/skbuff.h                         |   30 +++++++++++--------
 include/net/ip_vs.h                            |   11 ++++---
 include/net/netfilter/nf_conntrack.h           |   10 ++++--
 include/net/netfilter/nf_conntrack_core.h      |    2 -
 include/net/netfilter/nf_conntrack_l4proto.h   |    2 -
 net/core/skbuff.c                              |    2 -
 net/ipv4/netfilter/ipt_SYNPROXY.c              |    7 +---
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |   16 +++++-----
 net/ipv4/netfilter/nf_defrag_ipv4.c            |    4 +-
 net/ipv4/netfilter/nf_dup_ipv4.c               |    9 +++--
 net/ipv6/netfilter/ip6t_SYNPROXY.c             |    7 +---
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |   22 +++++++-------
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c      |    4 +-
 net/ipv6/netfilter/nf_dup_ipv6.c               |   10 +++---
 net/netfilter/core.c                           |    2 -
 net/netfilter/nf_conntrack_core.c              |   38 ++++++++++++-------------
 net/netfilter/nf_conntrack_proto_dccp.c        |    1 
 net/netfilter/nf_conntrack_proto_tcp.c         |    1 
 net/netfilter/nf_conntrack_proto_udp.c         |    3 -
 net/netfilter/nf_conntrack_standalone.c        |    3 +
 net/netfilter/nf_nat_helper.c                  |    2 -
 net/netfilter/nft_ct.c                         |    3 -
 net/netfilter/xt_CT.c                          |   13 +++-----
 net/openvswitch/conntrack.c                    |   22 ++++++--------
 net/sched/cls_flow.c                           |    2 -
 25 files changed, 118 insertions(+), 108 deletions(-)

^ permalink raw reply

* [PATCH nf-next 1/4] netfilter: conntrack: no need to pass ctinfo to error handler
From: Florian Westphal @ 2017-01-04 15:35 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1483544150-10686-1-git-send-email-fw@strlen.de>

It is never accessed for reading and the only places that write to it
are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).

The conntrack core specifically checks for attached skb->nfct after
->error() invocation and returns early in this case.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack_l4proto.h   |  2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   | 12 ++++++------
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 12 ++++++------
 net/netfilter/nf_conntrack_core.c              |  3 +--
 net/netfilter/nf_conntrack_proto_dccp.c        |  1 -
 net/netfilter/nf_conntrack_proto_tcp.c         |  1 -
 net/netfilter/nf_conntrack_proto_udp.c         |  3 +--
 7 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index e7b836590f0b..85e993e278d5 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -55,7 +55,7 @@ struct nf_conntrack_l4proto {
 	void (*destroy)(struct nf_conn *ct);
 
 	int (*error)(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
-		     unsigned int dataoff, enum ip_conntrack_info *ctinfo,
+		     unsigned int dataoff,
 		     u_int8_t pf, unsigned int hooknum);
 
 	/* Print out the per-protocol part of the tuple. Return like seq_* */
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index d075b3cf2400..2e95ece3b257 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -128,13 +128,13 @@ static bool icmp_new(struct nf_conn *ct, const struct sk_buff *skb,
 /* Returns conntrack if it dealt with ICMP, and filled in skb fields */
 static int
 icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
-		 enum ip_conntrack_info *ctinfo,
 		 unsigned int hooknum)
 {
 	struct nf_conntrack_tuple innertuple, origtuple;
 	const struct nf_conntrack_l4proto *innerproto;
 	const struct nf_conntrack_tuple_hash *h;
 	const struct nf_conntrack_zone *zone;
+	enum ip_conntrack_info ctinfo;
 	struct nf_conntrack_zone tmp;
 
 	NF_CT_ASSERT(skb->nfct == NULL);
@@ -160,7 +160,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 		return -NF_ACCEPT;
 	}
 
-	*ctinfo = IP_CT_RELATED;
+	ctinfo = IP_CT_RELATED;
 
 	h = nf_conntrack_find_get(net, zone, &innertuple);
 	if (!h) {
@@ -169,11 +169,11 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 	}
 
 	if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY)
-		*ctinfo += IP_CT_IS_REPLY;
+		ctinfo += IP_CT_IS_REPLY;
 
 	/* Update skb to refer to this connection */
 	skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
-	skb->nfctinfo = *ctinfo;
+	skb->nfctinfo = ctinfo;
 	return NF_ACCEPT;
 }
 
@@ -181,7 +181,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 static int
 icmp_error(struct net *net, struct nf_conn *tmpl,
 	   struct sk_buff *skb, unsigned int dataoff,
-	   enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
+	   u_int8_t pf, unsigned int hooknum)
 {
 	const struct icmphdr *icmph;
 	struct icmphdr _ih;
@@ -225,7 +225,7 @@ icmp_error(struct net *net, struct nf_conn *tmpl,
 	    icmph->type != ICMP_REDIRECT)
 		return NF_ACCEPT;
 
-	return icmp_error_message(net, tmpl, skb, ctinfo, hooknum);
+	return icmp_error_message(net, tmpl, skb, hooknum);
 }
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index f5a61bc3ec2b..56f4eebb5efe 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -145,12 +145,12 @@ static int
 icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 		     struct sk_buff *skb,
 		     unsigned int icmp6off,
-		     enum ip_conntrack_info *ctinfo,
 		     unsigned int hooknum)
 {
 	struct nf_conntrack_tuple intuple, origtuple;
 	const struct nf_conntrack_tuple_hash *h;
 	const struct nf_conntrack_l4proto *inproto;
+	enum ip_conntrack_info ctinfo;
 	struct nf_conntrack_zone tmp;
 
 	NF_CT_ASSERT(skb->nfct == NULL);
@@ -176,7 +176,7 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 		return -NF_ACCEPT;
 	}
 
-	*ctinfo = IP_CT_RELATED;
+	ctinfo = IP_CT_RELATED;
 
 	h = nf_conntrack_find_get(net, nf_ct_zone_tmpl(tmpl, skb, &tmp),
 				  &intuple);
@@ -185,19 +185,19 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 		return -NF_ACCEPT;
 	} else {
 		if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY)
-			*ctinfo += IP_CT_IS_REPLY;
+			ctinfo += IP_CT_IS_REPLY;
 	}
 
 	/* Update skb to refer to this connection */
 	skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
-	skb->nfctinfo = *ctinfo;
+	skb->nfctinfo = ctinfo;
 	return NF_ACCEPT;
 }
 
 static int
 icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	     struct sk_buff *skb, unsigned int dataoff,
-	     enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
+	     u_int8_t pf, unsigned int hooknum)
 {
 	const struct icmp6hdr *icmp6h;
 	struct icmp6hdr _ih;
@@ -232,7 +232,7 @@ icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	if (icmp6h->icmp6_type >= 128)
 		return NF_ACCEPT;
 
-	return icmpv6_error_message(net, tmpl, skb, dataoff, ctinfo, hooknum);
+	return icmpv6_error_message(net, tmpl, skb, dataoff, hooknum);
 }
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 3a073cd9fcf4..86186a2e2715 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1326,8 +1326,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 	 * inverse of the return code tells to the netfilter
 	 * core what to do with the packet. */
 	if (l4proto->error != NULL) {
-		ret = l4proto->error(net, tmpl, skb, dataoff, &ctinfo,
-				     pf, hooknum);
+		ret = l4proto->error(net, tmpl, skb, dataoff, pf, hooknum);
 		if (ret <= 0) {
 			NF_CT_STAT_INC_ATOMIC(net, error);
 			NF_CT_STAT_INC_ATOMIC(net, invalid);
diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index b68ce6ac13b3..93dd1c5b7bff 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -561,7 +561,6 @@ static int dccp_packet(struct nf_conn *ct, const struct sk_buff *skb,
 
 static int dccp_error(struct net *net, struct nf_conn *tmpl,
 		      struct sk_buff *skb, unsigned int dataoff,
-		      enum ip_conntrack_info *ctinfo,
 		      u_int8_t pf, unsigned int hooknum)
 {
 	struct dccp_hdr _dh, *dh;
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 69f687740c76..b122e9dacfed 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -750,7 +750,6 @@ static const u8 tcp_valid_flags[(TCPHDR_FIN|TCPHDR_SYN|TCPHDR_RST|TCPHDR_ACK|
 static int tcp_error(struct net *net, struct nf_conn *tmpl,
 		     struct sk_buff *skb,
 		     unsigned int dataoff,
-		     enum ip_conntrack_info *ctinfo,
 		     u_int8_t pf,
 		     unsigned int hooknum)
 {
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index ae63944c9dc4..f6ebce6178ca 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -112,7 +112,6 @@ static bool udp_new(struct nf_conn *ct, const struct sk_buff *skb,
 static int udplite_error(struct net *net, struct nf_conn *tmpl,
 			 struct sk_buff *skb,
 			 unsigned int dataoff,
-			 enum ip_conntrack_info *ctinfo,
 			 u8 pf, unsigned int hooknum)
 {
 	unsigned int udplen = skb->len - dataoff;
@@ -162,7 +161,7 @@ static int udplite_error(struct net *net, struct nf_conn *tmpl,
 #endif
 
 static int udp_error(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
-		     unsigned int dataoff, enum ip_conntrack_info *ctinfo,
+		     unsigned int dataoff,
 		     u_int8_t pf,
 		     unsigned int hooknum)
 {
-- 
2.7.3

^ permalink raw reply related

* Re: [PATCH] phy state machine: failsafe leave invalid RUNNING state
From: Zefir Kurtisi @ 2017-01-04 15:27 UTC (permalink / raw)
  To: Florian Fainelli, netdev; +Cc: andrew
In-Reply-To: <6845805d-4dae-0a3a-c56c-6feb86f4b553@gmail.com>

On 01/04/2017 04:13 PM, Florian Fainelli wrote:
> 
> 
> On 01/04/2017 07:04 AM, Zefir Kurtisi wrote:
>> While in RUNNING state, phy_state_machine() checks for link changes by
>> comparing phydev->link before and after calling phy_read_status().
>> This works as long as it is guaranteed that phydev->link is never
>> changed outside the phy_state_machine().
>>
>> If in some setups this happens, it causes the state machine to miss
>> a link loss and remain RUNNING despite phydev->link being 0.
>>
>> This has been observed running a dsa setup with a process continuously
>> polling the link states over ethtool each second (SNMPD RFC-1213
>> agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
>> causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
>> call phy_read_status() and with that modify the link status - and
>> with that bricking the phy state machine.
> 
> That's the interesting part of the analysis, how does this brick the PHY
> state machine? Is the PHY driver changing the link status in the
> read_status callback that it implements?
> 
phydev->read_status points to genphy_read_status(), where the first call goes to
genphy_update_link() which updates the link status.

Thereafter phy_state_machine():RUNNING won't be able to detect the link loss
anymore unless the link state changes again.


I was trying to figure out if there is a rule that forbids changing phydev->link
from outside the state machine, but found several places where it happens (either
directly, or over genphy_read_status() or over genphy_update_link()).

Curious how this did not show up before, since within the dsa setup it is very
easy to trigger:
a) physically disconnect link
b) within one second run ethtool ethX


Cheers,
Zefir

^ permalink raw reply

* [PATCH nf-next 4/4] netfilter: merge ctinfo into nfct pointer storage area
From: Florian Westphal @ 2017-01-04 15:35 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1483544150-10686-1-git-send-email-fw@strlen.de>

Merge conntrack related status bits into skb->_nfct.
After this change conntrack operations (lookup, creation, matching from
ruleset) only accesses one instead of two sk_buff cache lines.

This works for normal conntracks because we use a slab cache that
guarantees hw cacheline or 8byte alignemnt (whatever is larger) so
the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

For the conntrack templates we need an 8 byte kmalloc minalign, in case
we have arches where this isn't true a build-bug on test will catch this.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h                         | 30 +++++++++++++++-----------
 include/net/ip_vs.h                            |  4 ++--
 include/net/netfilter/nf_conntrack.h           | 10 ++++++---
 include/net/netfilter/nf_conntrack_core.h      |  2 +-
 net/core/skbuff.c                              |  2 +-
 net/ipv4/netfilter/ipt_SYNPROXY.c              |  7 +++---
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  6 +++---
 net/ipv4/netfilter/nf_defrag_ipv4.c            |  4 ++--
 net/ipv4/netfilter/nf_dup_ipv4.c               |  7 +++---
 net/ipv6/netfilter/ip6t_SYNPROXY.c             |  7 +++---
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 12 +++++------
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c      |  4 ++--
 net/ipv6/netfilter/nf_dup_ipv6.c               |  8 ++++---
 net/netfilter/core.c                           |  2 +-
 net/netfilter/nf_conntrack_core.c              | 22 +++++++++----------
 net/netfilter/nf_conntrack_standalone.c        |  4 ++++
 net/netfilter/nf_nat_helper.c                  |  2 +-
 net/netfilter/nft_ct.c                         |  3 +--
 net/netfilter/xt_CT.c                          | 13 ++++++-----
 net/openvswitch/conntrack.c                    | 22 +++++++++----------
 net/sched/cls_flow.c                           |  2 +-
 21 files changed, 91 insertions(+), 82 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b53c0cfd417e..fee935fea6f9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -585,7 +585,6 @@ static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
  *	@cloned: Head may be cloned (check refcnt to be sure)
  *	@ip_summed: Driver fed us an IP checksum
  *	@nohdr: Payload reference only, must not modify header
- *	@nfctinfo: Relationship of this skb to the connection
  *	@pkt_type: Packet class
  *	@fclone: skbuff clone status
  *	@ipvs_property: skbuff is owned by ipvs
@@ -594,7 +593,7 @@ static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
  *	@nf_trace: netfilter packet trace flag
  *	@protocol: Packet protocol from driver
  *	@destructor: Destruct function
- *	@nfct: Associated connection, if any
+ *	@_nfct: Associated connection, if any (with nfctinfo bits)
  *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
  *	@skb_iif: ifindex of device we arrived on
  *	@tc_index: Traffic control index
@@ -668,7 +667,7 @@ struct sk_buff {
 	struct	sec_path	*sp;
 #endif
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-	struct nf_conntrack	*nfct;
+	unsigned long		 _nfct;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	struct nf_bridge_info	*nf_bridge;
@@ -721,7 +720,6 @@ struct sk_buff {
 	__u8			pkt_type:3;
 	__u8			pfmemalloc:1;
 	__u8			ignore_df:1;
-	__u8			nfctinfo:3;
 
 	__u8			nf_trace:1;
 	__u8			ip_summed:2;
@@ -836,6 +834,7 @@ static inline bool skb_pfmemalloc(const struct sk_buff *skb)
 #define SKB_DST_NOREF	1UL
 #define SKB_DST_PTRMASK	~(SKB_DST_NOREF)
 
+#define SKB_NFCT_PTRMASK	~(7UL)
 /**
  * skb_dst - returns skb dst_entry
  * @skb: buffer
@@ -3555,6 +3554,15 @@ static inline void skb_remcsum_process(struct sk_buff *skb, void *ptr,
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 void nf_conntrack_destroy(struct nf_conntrack *nfct);
+static inline struct nf_conntrack *skb_nfct(const struct sk_buff *skb)
+{
+	struct nf_conntrack *nfct;
+
+	nfct = (void *) (skb->_nfct & SKB_NFCT_PTRMASK);
+
+	return nfct;
+}
+
 static inline void nf_conntrack_put(struct nf_conntrack *nfct)
 {
 	if (nfct && atomic_dec_and_test(&nfct->use))
@@ -3581,8 +3589,8 @@ static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
 static inline void nf_reset(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-	nf_conntrack_put(skb->nfct);
-	skb->nfct = NULL;
+	nf_conntrack_put(skb_nfct(skb));
+	skb->_nfct = 0;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	nf_bridge_put(skb->nf_bridge);
@@ -3602,10 +3610,8 @@ static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 			     bool copy)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-	dst->nfct = src->nfct;
-	nf_conntrack_get(src->nfct);
-	if (copy)
-		dst->nfctinfo = src->nfctinfo;
+	dst->_nfct = src->_nfct;
+	nf_conntrack_get(skb_nfct(src));
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	dst->nf_bridge  = src->nf_bridge;
@@ -3620,7 +3626,7 @@ static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 static inline void nf_copy(struct sk_buff *dst, const struct sk_buff *src)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-	nf_conntrack_put(dst->nfct);
+	nf_conntrack_put(skb_nfct(dst));
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	nf_bridge_put(dst->nf_bridge);
@@ -3653,7 +3659,7 @@ static inline bool skb_irq_freeable(const struct sk_buff *skb)
 		!skb->sp &&
 #endif
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-		!skb->nfct &&
+		!skb->_nfct &&
 #endif
 		!skb->_skb_refdst &&
 		!skb_has_frag_list(skb);
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 2a344ebd7ebe..cef639dd2d6e 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1559,8 +1559,8 @@ static inline void ip_vs_notrack(struct sk_buff *skb)
 		nf_conntrack_put(&ct->ct_general);
 		untracked = nf_ct_untracked_get();
 		nf_conntrack_get(&untracked->ct_general);
-		skb->nfct = &untracked->ct_general;
-		skb->nfctinfo = IP_CT_NEW;
+		skb->_nfct = (unsigned long)
+			&nf_ct_untracked_get()->ct_general | IP_CT_NEW;
 	}
 #endif
 }
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 5916aa9ab3f0..ccc18981f46b 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -75,7 +75,7 @@ struct nf_conn {
 	/* Usage count in here is 1 for hash table, 1 per skb,
 	 * plus 1 for any connection(s) we are `master' for
 	 *
-	 * Hint, SKB address this struct and refcnt via skb->nfct and
+	 * Hint, SKB address this struct and refcnt via skb->_nfct and
 	 * helpers nf_conntrack_get() and nf_conntrack_put().
 	 * Helper nf_ct_put() equals nf_conntrack_put() by dec refcnt,
 	 * beware nf_ct_get() is different and don't inc refcnt.
@@ -162,12 +162,16 @@ void nf_conntrack_alter_reply(struct nf_conn *ct,
 int nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 			     const struct nf_conn *ignored_conntrack);
 
+#define NFCT_INFOMASK	7UL
+#define NFCT_PTRMASK	~(NFCT_INFOMASK)
+
 /* Return conntrack_info and tuple hash for given skb. */
 static inline struct nf_conn *
 nf_ct_get(const struct sk_buff *skb, enum ip_conntrack_info *ctinfo)
 {
-	*ctinfo = skb->nfctinfo;
-	return (struct nf_conn *)skb->nfct;
+	*ctinfo = skb->_nfct & NFCT_INFOMASK;
+
+	return (struct nf_conn *)(skb->_nfct & NFCT_PTRMASK);
 }
 
 /* decrement reference count on a conntrack */
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 62e17d1319ff..10223ca4c96e 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -62,7 +62,7 @@ int __nf_conntrack_confirm(struct sk_buff *skb);
 /* Confirm a connection: returns NF_DROP if packet must be dropped. */
 static inline int nf_conntrack_confirm(struct sk_buff *skb)
 {
-	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+	struct nf_conn *ct = (struct nf_conn *)(skb->_nfct & NFCT_PTRMASK);
 	int ret = NF_ACCEPT;
 
 	if (ct && !nf_ct_is_untracked(ct)) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5a03730fbc1a..cac3ebfb4b45 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -655,7 +655,7 @@ static void skb_release_head_state(struct sk_buff *skb)
 		skb->destructor(skb);
 	}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	nf_conntrack_put(skb->nfct);
+	nf_conntrack_put(skb_nfct(skb));
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	nf_bridge_put(skb->nf_bridge);
diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c
index 30c0de53e254..5decabf56538 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -57,8 +57,7 @@ synproxy_send_tcp(struct net *net,
 		goto free_nskb;
 
 	if (nfct) {
-		nskb->nfct = nfct;
-		nskb->nfctinfo = ctinfo;
+		nskb->_nfct = (unsigned long) nfct | ctinfo;
 		nf_conntrack_get(nfct);
 	}
 
@@ -107,7 +106,7 @@ synproxy_send_client_synack(struct net *net,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
+	synproxy_send_tcp(net, skb, nskb, skb_nfct(skb), IP_CT_ESTABLISHED_REPLY,
 			  niph, nth, tcp_hdr_size);
 }
 
@@ -230,7 +229,7 @@ synproxy_send_client_ack(struct net *net,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
+	synproxy_send_tcp(net, skb, nskb, skb_nfct(skb), IP_CT_ESTABLISHED_REPLY,
 			  niph, nth, tcp_hdr_size);
 }
 
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 2e95ece3b257..e8aa492a13b0 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -137,7 +137,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 	enum ip_conntrack_info ctinfo;
 	struct nf_conntrack_zone tmp;
 
-	NF_CT_ASSERT(skb->nfct == NULL);
+	NF_CT_ASSERT(skb->_nfct == 0);
 	zone = nf_ct_zone_tmpl(tmpl, skb, &tmp);
 
 	/* Are they talking about one of our connections? */
@@ -172,8 +172,8 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 		ctinfo += IP_CT_IS_REPLY;
 
 	/* Update skb to refer to this connection */
-	skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
-	skb->nfctinfo = ctinfo;
+	skb->_nfct = (unsigned long)
+		&nf_ct_tuplehash_to_ctrack(h)->ct_general | ctinfo;
 	return NF_ACCEPT;
 }
 
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 49bd6a54404f..441584b92a9e 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -45,7 +45,7 @@ static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum,
 {
 	u16 zone_id = NF_CT_DEFAULT_ZONE_ID;
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	if (skb->nfct) {
+	if (skb->_nfct) {
 		enum ip_conntrack_info ctinfo;
 		const struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 
@@ -75,7 +75,7 @@ static unsigned int ipv4_conntrack_defrag(void *priv,
 #if !IS_ENABLED(CONFIG_NF_NAT)
 	/* Previously seen (loopback)?  Ignore.  Do this before
 	   fragment check. */
-	if (skb->nfct && !nf_ct_is_template((struct nf_conn *)skb->nfct))
+	if (skb->_nfct && !nf_ct_is_template((struct nf_conn *) skb_nfct(skb)));
 		return NF_ACCEPT;
 #endif
 #endif
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index a981ef7151ca..71fd5c4e5376 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -53,6 +53,7 @@ static bool nf_dup_ipv4_route(struct net *net, struct sk_buff *skb,
 void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 		 const struct in_addr *gw, int oif)
 {
+	struct nf_conn *untracked;
 	struct iphdr *iph;
 
 	if (this_cpu_read(nf_skb_duplicated))
@@ -68,10 +69,10 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	/* Avoid counting cloned packets towards the original connection. */
+	untracked = nf_ct_untracked_get();
 	nf_reset(skb);
-	skb->nfct     = &nf_ct_untracked_get()->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
-	nf_conntrack_get(skb->nfct);
+	nf_conntrack_get(&untracked->ct_general);
+	skb->_nfct = (unsigned long) untracked | IP_CT_NEW;
 #endif
 	/*
 	 * If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential
diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c b/net/ipv6/netfilter/ip6t_SYNPROXY.c
index 98c8dd38575a..7412be3f10ce 100644
--- a/net/ipv6/netfilter/ip6t_SYNPROXY.c
+++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c
@@ -71,8 +71,7 @@ synproxy_send_tcp(struct net *net,
 	skb_dst_set(nskb, dst);
 
 	if (nfct) {
-		nskb->nfct = nfct;
-		nskb->nfctinfo = ctinfo;
+		nskb->_nfct = (unsigned long) nfct | ctinfo;
 		nf_conntrack_get(nfct);
 	}
 
@@ -121,7 +120,7 @@ synproxy_send_client_synack(struct net *net,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
+	synproxy_send_tcp(net, skb, nskb, skb_nfct(skb), IP_CT_ESTABLISHED_REPLY,
 			  niph, nth, tcp_hdr_size);
 }
 
@@ -244,7 +243,7 @@ synproxy_send_client_ack(struct net *net,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
+	synproxy_send_tcp(net, skb, nskb, skb_nfct(skb), IP_CT_ESTABLISHED_REPLY,
 			  niph, nth, tcp_hdr_size);
 }
 
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 56f4eebb5efe..b6f97a4e6dbd 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -153,7 +153,7 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 	enum ip_conntrack_info ctinfo;
 	struct nf_conntrack_zone tmp;
 
-	NF_CT_ASSERT(skb->nfct == NULL);
+	NF_CT_ASSERT(skb->_nfct == 0);
 
 	/* Are they talking about one of our connections? */
 	if (!nf_ct_get_tuplepr(skb,
@@ -189,8 +189,8 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 	}
 
 	/* Update skb to refer to this connection */
-	skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
-	skb->nfctinfo = ctinfo;
+	skb->_nfct = (unsigned long)
+		&nf_ct_tuplehash_to_ctrack(h)->ct_general | ctinfo;
 	return NF_ACCEPT;
 }
 
@@ -222,9 +222,9 @@ icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	type = icmp6h->icmp6_type - 130;
 	if (type >= 0 && type < sizeof(noct_valid_new) &&
 	    noct_valid_new[type]) {
-		skb->nfct = &nf_ct_untracked_get()->ct_general;
-		skb->nfctinfo = IP_CT_NEW;
-		nf_conntrack_get(skb->nfct);
+		skb->_nfct = (unsigned long)
+			&nf_ct_untracked_get()->ct_general | IP_CT_NEW;
+		nf_conntrack_get(skb_nfct(skb));
 		return NF_ACCEPT;
 	}
 
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index 8e0bdd058787..e6c74e136f7a 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -37,7 +37,7 @@ static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
 {
 	u16 zone_id = NF_CT_DEFAULT_ZONE_ID;
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	if (skb->nfct) {
+	if (skb->_nfct) {
 		enum ip_conntrack_info ctinfo;
 		const struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 
@@ -61,7 +61,7 @@ static unsigned int ipv6_defrag(void *priv,
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	/* Previously seen (loopback)?	*/
-	if (skb->nfct && !nf_ct_is_template((struct nf_conn *)skb->nfct))
+	if (skb->_nfct && !nf_ct_is_template((struct nf_conn *)(skb_nfct(skb))))
 		return NF_ACCEPT;
 #endif
 
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 5f52e5f90e7e..6d7a119d5eee 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -50,6 +50,8 @@ static bool nf_dup_ipv6_route(struct net *net, struct sk_buff *skb,
 void nf_dup_ipv6(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 		 const struct in6_addr *gw, int oif)
 {
+	struct nf_conn *untracked;
+
 	if (this_cpu_read(nf_skb_duplicated))
 		return;
 	skb = pskb_copy(skb, GFP_ATOMIC);
@@ -57,10 +59,10 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 		return;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
+	untracked = nf_ct_untracked_get();
 	nf_reset(skb);
-	skb->nfct     = &nf_ct_untracked_get()->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
-	nf_conntrack_get(skb->nfct);
+	nf_conntrack_get(&untracked->ct_general);
+	skb->_nfct = (unsigned long) untracked | IP_CT_NEW;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING ||
 	    hooknum == NF_INET_LOCAL_IN) {
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index ce6adfae521a..a87a6f8a74d8 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -375,7 +375,7 @@ void nf_ct_attach(struct sk_buff *new, const struct sk_buff *skb)
 {
 	void (*attach)(struct sk_buff *, const struct sk_buff *);
 
-	if (skb->nfct) {
+	if (skb->_nfct) {
 		rcu_read_lock();
 		attach = rcu_dereference(ip_ct_attach);
 		if (attach)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index adb7af3a4c4c..1fddcb728250 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -694,7 +694,7 @@ static int nf_ct_resolve_clash(struct net *net, struct sk_buff *skb,
 		/* Assign conntrack already in hashes to this skbuff. Don't
 		 * modify skb->nfctinfo to ensure consistent stateful filtering.
 		 */
-		skb->nfct = &ct->ct_general;
+		skb->_nfct = (unsigned long)ct | oldinfo;
 		return NF_ACCEPT;
 	}
 	NF_CT_STAT_INC(net, drop);
@@ -1223,7 +1223,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 	return &ct->tuplehash[IP_CT_DIR_ORIGINAL];
 }
 
-/* On success, returns conntrack ptr, sets skb->nfct and ctinfo */
+/* On success, returns conntrack ptr, sets skb->_nfct | ctinfo */
 static inline struct nf_conn *
 resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
 		  struct sk_buff *skb,
@@ -1282,8 +1282,7 @@ resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
 		}
 		*set_reply = 0;
 	}
-	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = *ctinfo;
+	skb->_nfct = (unsigned long) &ct->ct_general | *ctinfo;
 	return ct;
 }
 
@@ -1308,7 +1307,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 			NF_CT_STAT_INC_ATOMIC(net, ignore);
 			return NF_ACCEPT;
 		}
-		skb->nfct = NULL;
+		skb->_nfct = 0;
 	}
 
 	/* rcu_read_lock()ed by nf_hook_thresh */
@@ -1337,7 +1336,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 			goto out;
 		}
 		/* ICMP[v6] protocol trackers may assign one conntrack. */
-		if (skb->nfct)
+		if (skb->_nfct)
 			goto out;
 	}
 repeat:
@@ -1357,7 +1356,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		goto out;
 	}
 
-	NF_CT_ASSERT(skb->nfct);
+	NF_CT_ASSERT(skb->_nfct);
 
 	/* Decide what timeout policy we want to apply to this flow. */
 	timeouts = nf_ct_timeout_lookup(net, ct, l4proto);
@@ -1368,7 +1367,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		 * the netfilter core what to do */
 		pr_debug("nf_conntrack_in: Can't track with proto module\n");
 		nf_conntrack_put(&ct->ct_general);
-		skb->nfct = NULL;
+		skb->_nfct = 0;
 		NF_CT_STAT_INC_ATOMIC(net, invalid);
 		if (ret == -NF_DROP)
 			NF_CT_STAT_INC_ATOMIC(net, drop);
@@ -1526,9 +1525,8 @@ static void nf_conntrack_attach(struct sk_buff *nskb, const struct sk_buff *skb)
 		ctinfo = IP_CT_RELATED;
 
 	/* Attach to new skbuff, and increment count */
-	nskb->nfct = &ct->ct_general;
-	nskb->nfctinfo = ctinfo;
-	nf_conntrack_get(nskb->nfct);
+	nskb->_nfct = (unsigned long) &ct->ct_general | ctinfo;
+	nf_conntrack_get(skb_nfct(nskb));
 }
 
 /* Bring out ya dead! */
@@ -1864,7 +1862,7 @@ int nf_conntrack_init_start(void)
 	nf_conntrack_max = max_factor * nf_conntrack_htable_size;
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
-						sizeof(struct nf_conn), 0,
+						sizeof(struct nf_conn), NFCT_INFOMASK + 1,
 						SLAB_DESTROY_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
 	if (!nf_conntrack_cachep)
 		goto err_cachep;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index d009ae663453..c1e5fea20fb2 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -642,6 +642,10 @@ static int __init nf_conntrack_standalone_init(void)
 	if (ret < 0)
 		goto out_start;
 
+	BUILD_BUG_ON(SKB_NFCT_PTRMASK != NFCT_PTRMASK);
+	BUILD_BUG_ON(NFCT_INFOMASK <= IP_CT_NUMBER);
+	BUILD_BUG_ON(NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN);
+
 #ifdef CONFIG_SYSCTL
 	nf_ct_netfilter_header =
 		register_net_sysctl(&init_net, "net", nf_ct_netfilter_table);
diff --git a/net/netfilter/nf_nat_helper.c b/net/netfilter/nf_nat_helper.c
index 2840abb5bb99..211661cb2c90 100644
--- a/net/netfilter/nf_nat_helper.c
+++ b/net/netfilter/nf_nat_helper.c
@@ -60,7 +60,7 @@ static void mangle_contents(struct sk_buff *skb,
 		__skb_trim(skb, skb->len + rep_len - match_len);
 	}
 
-	if (nf_ct_l3num((struct nf_conn *)skb->nfct) == NFPROTO_IPV4) {
+	if (nf_ct_l3num((struct nf_conn *)skb_nfct(skb)) == NFPROTO_IPV4) {
 		/* fix IP hdr checksum information */
 		ip_hdr(skb)->tot_len = htons(skb->len);
 		ip_send_check(ip_hdr(skb));
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index d774d7823688..e05865c98ae0 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -554,8 +554,7 @@ static void nft_notrack_eval(const struct nft_expr *expr,
 
 	ct = nf_ct_untracked_get();
 	atomic_inc(&ct->ct_general.use);
-	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->_nfct = (unsigned long)ct | IP_CT_NEW;
 }
 
 static struct nft_expr_type nft_notrack_type;
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 95c750358747..f18ceb47dbf0 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -23,15 +23,14 @@
 static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct)
 {
 	/* Previously seen (loopback)? Ignore. */
-	if (skb->nfct != NULL)
+	if (skb->_nfct != 0)
 		return XT_CONTINUE;
 
 	/* special case the untracked ct : we want the percpu object */
 	if (!ct)
 		ct = nf_ct_untracked_get();
 	atomic_inc(&ct->ct_general.use);
-	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->_nfct = (unsigned long) &ct->ct_general | IP_CT_NEW;
 
 	return XT_CONTINUE;
 }
@@ -407,12 +406,12 @@ static unsigned int
 notrack_tg(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	/* Previously seen (loopback)? Ignore. */
-	if (skb->nfct != NULL)
+	if (skb->_nfct != 0)
 		return XT_CONTINUE;
 
-	skb->nfct = &nf_ct_untracked_get()->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
-	nf_conntrack_get(skb->nfct);
+	skb->_nfct = (unsigned long)
+			&nf_ct_untracked_get()->ct_general | IP_CT_NEW;
+	nf_conntrack_get(skb_nfct(skb));
 
 	return XT_CONTINUE;
 }
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 6b78bab27755..63c86d130008 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -157,7 +157,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
 	ovs_ct_get_labels(ct, &key->ct.labels);
 }
 
-/* Update 'key' based on skb->nfct.  If 'post_ct' is true, then OVS has
+/* Update 'key' based on skb->_nfct.  If 'post_ct' is true, then OVS has
  * previously sent the packet to conntrack via the ct action.  If
  * 'keep_nat_flags' is true, the existing NAT flags retained, else they are
  * initialized from the connection status.
@@ -421,11 +421,11 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h)
 
 /* Find an existing connection which this packet belongs to without
  * re-attributing statistics or modifying the connection state.  This allows an
- * skb->nfct lost due to an upcall to be recovered during actions execution.
+ * skb->_nfct lost due to an upcall to be recovered during actions execution.
  *
  * Must be called with rcu_read_lock.
  *
- * On success, populates skb->nfct and skb->nfctinfo, and returns the
+ * On success, populates skb->_nfct and returns the
  * connection.  Returns NULL if there is no existing entry.
  */
 static struct nf_conn *
@@ -460,12 +460,11 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone,
 
 	ct = nf_ct_tuplehash_to_ctrack(h);
 
-	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = ovs_ct_get_info(h);
+	skb->_nfct = (unsigned long)ct | ovs_ct_get_info(h);
 	return ct;
 }
 
-/* Determine whether skb->nfct is equal to the result of conntrack lookup. */
+/* Determine whether skb->_nfct is equal to the result of conntrack lookup. */
 static bool skb_nfct_cached(struct net *net,
 			    const struct sw_flow_key *key,
 			    const struct ovs_conntrack_info *info,
@@ -476,7 +475,7 @@ static bool skb_nfct_cached(struct net *net,
 
 	ct = nf_ct_get(skb, &ctinfo);
 	/* If no ct, check if we have evidence that an existing conntrack entry
-	 * might be found for this skb.  This happens when we lose a skb->nfct
+	 * might be found for this skb.  This happens when we lose a skb->_nfct
 	 * due to an upcall.  If the connection was not confirmed, it is not
 	 * cached and needs to be run through conntrack again.
 	 */
@@ -721,11 +720,10 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
 
 		/* Associate skb with specified zone. */
 		if (tmpl) {
-			if (skb->nfct)
-				nf_conntrack_put(skb->nfct);
+			if (skb->_nfct)
+				nf_conntrack_put(skb_nfct(skb));
 			nf_conntrack_get(&tmpl->ct_general);
-			skb->nfct = &tmpl->ct_general;
-			skb->nfctinfo = IP_CT_NEW;
+			skb->_nfct = (unsigned long)tmpl | IP_CT_NEW;
 		}
 
 		err = nf_conntrack_in(net, info->family,
@@ -819,7 +817,7 @@ static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
 		if (err)
 			return err;
 
-		ct = (struct nf_conn *)skb->nfct;
+		ct = (struct nf_conn *)skb_nfct(skb);
 		if (ct)
 			nf_ct_deliver_cached_events(ct);
 	}
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 6575aba87630..3d6b9286c203 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -129,7 +129,7 @@ static u32 flow_get_mark(const struct sk_buff *skb)
 static u32 flow_get_nfct(const struct sk_buff *skb)
 {
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	return addr_fold(skb->nfct);
+	return addr_fold(skb_nfct(skb));
 #else
 	return 0;
 #endif
-- 
2.7.3

^ permalink raw reply related

* [PATCH nf-next 3/4] netfilter: reduce direct skb->nfct usage
From: Florian Westphal @ 2017-01-04 15:35 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1483544150-10686-1-git-send-email-fw@strlen.de>

Next patch makes direct skb->nfct access illegal, reduce noise
in next patch by using accessors we already have.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/ip_vs.h               |  9 ++++++---
 net/netfilter/nf_conntrack_core.c | 15 +++++++++------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index cd6018a9ee24..2a344ebd7ebe 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1554,10 +1554,13 @@ static inline void ip_vs_notrack(struct sk_buff *skb)
 	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 
 	if (!ct || !nf_ct_is_untracked(ct)) {
-		nf_conntrack_put(skb->nfct);
-		skb->nfct = &nf_ct_untracked_get()->ct_general;
+		struct nf_conn *untracked;
+
+		nf_conntrack_put(&ct->ct_general);
+		untracked = nf_ct_untracked_get();
+		nf_conntrack_get(&untracked->ct_general);
+		skb->nfct = &untracked->ct_general;
 		skb->nfctinfo = IP_CT_NEW;
-		nf_conntrack_get(skb->nfct);
 	}
 #endif
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 86186a2e2715..adb7af3a4c4c 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -686,8 +686,11 @@ static int nf_ct_resolve_clash(struct net *net, struct sk_buff *skb,
 	    !nfct_nat(ct) &&
 	    !nf_ct_is_dying(ct) &&
 	    atomic_inc_not_zero(&ct->ct_general.use)) {
-		nf_ct_acct_merge(ct, ctinfo, (struct nf_conn *)skb->nfct);
-		nf_conntrack_put(skb->nfct);
+		enum ip_conntrack_info oldinfo;
+		struct nf_conn *loser_ct = nf_ct_get(skb, &oldinfo);
+
+		nf_ct_acct_merge(ct, ctinfo, loser_ct);
+		nf_conntrack_put(&loser_ct->ct_general);
 		/* Assign conntrack already in hashes to this skbuff. Don't
 		 * modify skb->nfctinfo to ensure consistent stateful filtering.
 		 */
@@ -1288,7 +1291,7 @@ unsigned int
 nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		struct sk_buff *skb)
 {
-	struct nf_conn *ct, *tmpl = NULL;
+	struct nf_conn *ct, *tmpl;
 	enum ip_conntrack_info ctinfo;
 	struct nf_conntrack_l3proto *l3proto;
 	struct nf_conntrack_l4proto *l4proto;
@@ -1298,9 +1301,9 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 	int set_reply = 0;
 	int ret;
 
-	if (skb->nfct) {
+	tmpl = nf_ct_get(skb, &ctinfo);
+	if (tmpl) {
 		/* Previously seen (loopback or untracked)?  Ignore. */
-		tmpl = (struct nf_conn *)skb->nfct;
 		if (!nf_ct_is_template(tmpl)) {
 			NF_CT_STAT_INC_ATOMIC(net, ignore);
 			return NF_ACCEPT;
@@ -1364,7 +1367,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		/* Invalid: inverse of the return code tells
 		 * the netfilter core what to do */
 		pr_debug("nf_conntrack_in: Can't track with proto module\n");
-		nf_conntrack_put(skb->nfct);
+		nf_conntrack_put(&ct->ct_general);
 		skb->nfct = NULL;
 		NF_CT_STAT_INC_ATOMIC(net, invalid);
 		if (ret == -NF_DROP)
-- 
2.7.3

^ permalink raw reply related

* [PATCH nf-next 2/4] netfilter: reset netfilter state when duplicating packet
From: Florian Westphal @ 2017-01-04 15:35 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1483544150-10686-1-git-send-email-fw@strlen.de>

We should also toss nf_bridge_info, if any -- packet is leaving via
ip_local_out, also, this skb isn't bridged -- it is a locally generated
copy.  Also this avoids the need to touch this later when skb->nfct is
replaced with 'unsigned long _nfct' in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/netfilter/nf_dup_ipv4.c | 2 +-
 net/ipv6/netfilter/nf_dup_ipv6.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index cf986e1c7bbd..a981ef7151ca 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -68,7 +68,7 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	/* Avoid counting cloned packets towards the original connection. */
-	nf_conntrack_put(skb->nfct);
+	nf_reset(skb);
 	skb->nfct     = &nf_ct_untracked_get()->ct_general;
 	skb->nfctinfo = IP_CT_NEW;
 	nf_conntrack_get(skb->nfct);
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 4a84b5ad9ecb..5f52e5f90e7e 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -57,7 +57,7 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 		return;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	nf_conntrack_put(skb->nfct);
+	nf_reset(skb);
 	skb->nfct     = &nf_ct_untracked_get()->ct_general;
 	skb->nfctinfo = IP_CT_NEW;
 	nf_conntrack_get(skb->nfct);
-- 
2.7.3

^ permalink raw reply related

* RE: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in sysfs
From: Tantilov, Emil S @ 2017-01-04 16:00 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-pci@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	Duyck, Alexander H, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170104021531.GA567@gwshan>

>-----Original Message-----
>From: Gavin Shan [mailto:gwshan@linux.vnet.ibm.com]
>Sent: Tuesday, January 03, 2017 6:16 PM
>To: Tantilov, Emil S <emil.s.tantilov@intel.com>
>Cc: linux-pci@vger.kernel.org; intel-wired-lan@lists.osuosl.org; Duyck,
>Alexander H <alexander.h.duyck@intel.com>; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in
>sysfs
>
>On Tue, Jan 03, 2017 at 04:48:31PM -0800, Emil Tantilov wrote:
>>Enabling/disabling SRIOV via sysfs by echo-ing multiple values
>>simultaneously:
>>
>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs&
>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs
>>
>>sleep 5
>>
>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs&
>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs
>>
>>Results in the following bug:
>>
>>kernel BUG at drivers/pci/iov.c:495!
>>invalid opcode: 0000 [#1] SMP
>>CPU: 1 PID: 8050 Comm: bash Tainted: G   W   4.9.0-rc7-net-next #2092
>>RIP: 0010:[<ffffffff813b1647>]
>>	  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>
>>Call Trace:
>> [<ffffffff81391726>] pci_release_dev+0x26/0x70
>> [<ffffffff8155be6e>] device_release+0x3e/0xb0
>> [<ffffffff81365ee7>] kobject_cleanup+0x67/0x180
>> [<ffffffff81365d9d>] kobject_put+0x2d/0x60
>> [<ffffffff8155bc27>] put_device+0x17/0x20
>> [<ffffffff8139c08a>] pci_dev_put+0x1a/0x20
>> [<ffffffff8139cb6b>] pci_get_dev_by_id+0x5b/0x90
>> [<ffffffff8139cca5>] pci_get_subsys+0x35/0x40
>> [<ffffffff8139ccc8>] pci_get_device+0x18/0x20
>> [<ffffffff8139ccfb>] pci_get_domain_bus_and_slot+0x2b/0x60
>> [<ffffffff813b09e7>] pci_iov_remove_virtfn+0x57/0x180
>> [<ffffffff813b0b95>] pci_disable_sriov+0x65/0x140
>> [<ffffffffa00a1af7>] ixgbe_disable_sriov+0xc7/0x1d0 [ixgbe]
>> [<ffffffffa00a1e9d>] ixgbe_pci_sriov_configure+0x3d/0x170 [ixgbe]
>> [<ffffffff8139d28c>] sriov_numvfs_store+0xdc/0x130
>>...
>>RIP  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>
>>Use the existing mutex lock to protect each enable/disable operation.
>>
>>CC: Alexander Duyck <alexander.h.duyck@intel.com>
>>Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>
>Emil, It's going to change semantics of pci_enable_sriov() and pci_disable_sriov().
>They can be invoked when writing to the sysfs entry, or loading PF's
>driver. With the change applied, the lock (pf->sriov->lock) isn't acquired and released
>in the PF's driver loading path.

The enablement of SRIOV on driver load is done via deprecated module parameter.
Perhaps we can just remove it, although there are probably still people that use it
and may not be happy if we get rid of it. 

>I think the reasonable way would be adding a flag in "struct sriov", to
>indicate someone is accessing the IOV capability through sysfs file. With this, the
>code returns with "-EBUSY" immediately for contenders. With it, nothing is going
>to be changed in PF's driver loading path.

Flag is what I initially had in mind, but did not want to add extra locking if we 
can make use of the existing.

>Also, there are some minor comments as below and I guess most of them won't
>be applied if you take my suggestion eventually. However, I'm trying to make
>the comments complete.

Thanks a lot for reviewing!

>
>>---
>> drivers/pci/pci-sysfs.c |   24 +++++++++++++++++-------
>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>
>>diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>index 0666287..5b54cf5 100644
>>--- a/drivers/pci/pci-sysfs.c
>>+++ b/drivers/pci/pci-sysfs.c
>>@@ -472,7 +472,9 @@ static ssize_t sriov_numvfs_store(struct device *dev,
>> 				  const char *buf, size_t count)
>> {
>> 	struct pci_dev *pdev = to_pci_dev(dev);
>>+	struct pci_sriov *iov = pdev->sriov;
>> 	int ret;
>>+
>
>Unnecessary change.
>
>> 	u16 num_vfs;
>>
>> 	ret = kstrtou16(buf, 0, &num_vfs);
>>@@ -482,38 +484,46 @@ static ssize_t sriov_numvfs_store(struct device
>*dev,
>> 	if (num_vfs > pci_sriov_get_totalvfs(pdev))
>> 		return -ERANGE;
>>
>>+	mutex_lock(&iov->dev->sriov->lock);
>>+
>> 	if (num_vfs == pdev->sriov->num_VFs)
>>-		return count;		/* no change */
>>+		goto exit;
>>
>> 	/* is PF driver loaded w/callback */
>> 	if (!pdev->driver || !pdev->driver->sriov_configure) {
>> 		dev_info(&pdev->dev, "Driver doesn't support SRIOV
>configuration via sysfs\n");
>>-		return -ENOSYS;
>>+		ret = -EINVAL;
>>+		goto exit;
>
>Why we need change the error code here?

checkpatch was complaining about the use of the ENOSYS error code being specific
and even though it was not my patch introducing it I had to change it to shut it up.

>> 	}
>>
>> 	if (num_vfs == 0) {
>> 		/* disable VFs */
>> 		ret = pdev->driver->sriov_configure(pdev, 0);
>>-		if (ret < 0)
>>-			return ret;
>>-		return count;
>>+		goto exit;
>> 	}
>>
>> 	/* enable VFs */
>> 	if (pdev->sriov->num_VFs) {
>> 		dev_warn(&pdev->dev, "%d VFs already enabled. Disable before enabling %d VFs\n",
>> 			 pdev->sriov->num_VFs, num_vfs);
>>-		return -EBUSY;
>>+		ret = -EBUSY;
>>+		goto exit;
>> 	}
>>
>> 	ret = pdev->driver->sriov_configure(pdev, num_vfs);
>> 	if (ret < 0)
>>-		return ret;
>>+		goto exit;
>>
>> 	if (ret != num_vfs)
>> 		dev_warn(&pdev->dev, "%d VFs requested; only %d enabled\n",
>> 			 num_vfs, ret);
>>
>>+exit:
>>+	mutex_unlock(&iov->dev->sriov->lock);
>>+
>>+	if (ret < 0)
>>+		return ret;
>>+
>> 	return count;
>
>The code might be clearer if @ret is returned here. In that case, We need
>set it properly in error paths.

I played with different ways to handle this and this seemed the least intrusive.

Thanks,
Emil

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox