* Re: [PATCH net-next] bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
From: Jakub Kicinski @ 2016-11-30 22:50 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: davem, alexei.starovoitov, saeedm, Yuval.Mintz, netdev
In-Reply-To: <b70c47e0284823e6a5db600ae75c01eac4cf7922.1480538565.git.daniel@iogearbox.net>
On Wed, 30 Nov 2016 22:16:06 +0100, Daniel Borkmann wrote:
> After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
> need to hold rcu_read_lock() already to make sure BPF program doesn't
> get released in the background.
>
> Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
> Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
> in XDP supported drivers and to keep the typecheck on the context intact.
> For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
> just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
> out of the helper. When the driver gets atomic replace support, this will
> move to call-sites eventually.
>
> mlx5 needs actual fixing as it has the same issue as described already in
> 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> that is, we're under RCU bh at this time, BPF programs are released via
> call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
> read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
> reset.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
FWIW:
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Thanks!
^ permalink raw reply
* [PATCH 2/2] net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
From: Lino Sanfilippo @ 2016-11-30 22:48 UTC (permalink / raw)
To: vbridger; +Cc: nios2-dev, linux-kernel, netdev, Lino Sanfilippo
In-Reply-To: <1480546112-3099-1-git-send-email-LinoSanfilippo@gmx.de>
The driver already uses its private lock for synchronization between xmit
and xmit completion handler making the additional use of the xmit_lock
unnecessary.
Furthermore the driver does not set NETIF_F_LLTX resulting in xmit to be
called with the xmit_lock held and then taking the private lock while xmit
completion handler does the reverse, first take the private lock, then the
xmit_lock.
Fix these issues by not taking the xmit_lock in the tx completion handler.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
drivers/net/ethernet/altera/altera_tse_main.c | 2 --
1 file changed, 2 deletions(-)
Please note that this is only compile tested since I do not have the
concerning hardware.
diff --git a/drivers/net/ethernet/altera/altera_tse_main.c b/drivers/net/ethernet/altera/altera_tse_main.c
index 16c4163..cddc532 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -463,7 +463,6 @@ static int tse_tx_complete(struct altera_tse_private *priv)
if (unlikely(netif_queue_stopped(priv->dev) &&
tse_tx_avail(priv) > TSE_TX_THRESH(priv))) {
- netif_tx_lock(priv->dev);
if (netif_queue_stopped(priv->dev) &&
tse_tx_avail(priv) > TSE_TX_THRESH(priv)) {
if (netif_msg_tx_done(priv))
@@ -471,7 +470,6 @@ static int tse_tx_complete(struct altera_tse_private *priv)
__func__);
netif_wake_queue(priv->dev);
}
- netif_tx_unlock(priv->dev);
}
spin_unlock(&priv->tx_lock);
--
2.7.4
^ permalink raw reply related
* [PATCH 1/2] net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
From: Lino Sanfilippo @ 2016-11-30 22:48 UTC (permalink / raw)
To: vbridger; +Cc: nios2-dev, linux-kernel, netdev, Lino Sanfilippo
An explicit dma sync for device directly after mapping as well as an
explicit dma sync for cpu directly before unmapping is unnecessary and
costly on the hotpath. So remove these calls.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
drivers/net/ethernet/altera/altera_tse_main.c | 10 ----------
1 file changed, 10 deletions(-)
Please note that this is only compile tested since I do not have the
concerning hardware.
diff --git a/drivers/net/ethernet/altera/altera_tse_main.c b/drivers/net/ethernet/altera/altera_tse_main.c
index bda31f3..16c4163 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -400,12 +400,6 @@ static int tse_rx(struct altera_tse_private *priv, int limit)
skb_put(skb, pktlength);
- /* make cache consistent with receive packet buffer */
- dma_sync_single_for_cpu(priv->device,
- priv->rx_ring[entry].dma_addr,
- priv->rx_ring[entry].len,
- DMA_FROM_DEVICE);
-
dma_unmap_single(priv->device, priv->rx_ring[entry].dma_addr,
priv->rx_ring[entry].len, DMA_FROM_DEVICE);
@@ -592,10 +586,6 @@ static int tse_start_xmit(struct sk_buff *skb, struct net_device *dev)
buffer->dma_addr = dma_addr;
buffer->len = nopaged_len;
- /* Push data out of the cache hierarchy into main memory */
- dma_sync_single_for_device(priv->device, buffer->dma_addr,
- buffer->len, DMA_TO_DEVICE);
-
priv->dmaops->tx_buffer(priv, buffer);
skb_tx_timestamp(skb);
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v2 1/7] net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
From: Martin Blumenstingl @ 2016-11-30 22:33 UTC (permalink / raw)
To: Rob Herring
Cc: linux-amlogic, devicetree, netdev, davem, khilman, mark.rutland,
linux-arm-kernel, alexandre.torgue, peppe.cavallaro, will.deacon,
catalin.marinas, carlo, f.fainelli
In-Reply-To: <20161130214429.albylffpoumbeugi@rob-hp-laptop>
On Wed, Nov 30, 2016 at 10:44 PM, Rob Herring <robh@kernel.org> wrote:
> On Fri, Nov 25, 2016 at 02:01:50PM +0100, Martin Blumenstingl wrote:
>> This allows configuring the RGMII TX clock delay. The RGMII clock is
>> generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue.
>> The configuration depends on the actual hardware (no delay may be
>> needed due to the design of the actual circuit, the PHY might add this
>> delay, etc.).
>>
>> Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
>> ---
>> Documentation/devicetree/bindings/net/meson-dwmac.txt | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt b/Documentation/devicetree/bindings/net/meson-dwmac.txt
>> index 89e62dd..f8bc540 100644
>> --- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
>> +++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt
>> @@ -25,6 +25,20 @@ Required properties on Meson8b and newer:
>> - "clkin0" - first parent clock of the internal mux
>> - "clkin1" - second parent clock of the internal mux
>>
>> +Optional properties on Meson8b and newer:
>> +- amlogic,tx-delay-ns: The internal RGMII TX clock delay (provided
>> + by this driver) in nanoseconds. Allowed values
>> + are: 0ns, 2ns, 4ns, 6ns.
>> + This must be configured when the phy-mode is
>> + "rgmii" (typically a value of 2ns is used in
>> + this case).
>> + When phy-mode is set to "rgmii-id" or
>> + "rgmii-txid" the TX clock delay is already
>> + provided by the PHY. In that case this
>> + property should be set to 0ns (which disables
>> + the TX clock delay in the MAC to prevent the
>> + clock from going off because both PHY and MAC
>> + are adding a delay).
>
> What's the default? Why can't no property mean 0ns delay?
This value (2ns) was previously hardcoded. Having a default of 0ns
means that patch 7 ("ARM64: dts: amlogic: add the ethernet TX delay
configuration") becomes mandatory (otherwise we'll have broken Gbit
ethernet again because the TX delay is now disabled in the PHY when
phy-mode is "rgmii" and additionally we don't apply the 2ns TX delay
on the MAC side when the property is missing).
I'm fine with either way though - just let me know so I can adjust the
code accordingly.
^ permalink raw reply
* [PATCH] tcp_bbr: fix Kconfig to be able to make BBR the default congestion algorithm
From: Bernhard Held @ 2016-11-30 22:31 UTC (permalink / raw)
To: David S. Miller, linux-kernel, netdev; +Cc: Neal Cardwell
Add missing line to be able to make BBR the default
congestion algorithm.
Signed-off-by: Bernhard Held <berny156@gmx.de>
---
net/ipv4/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 300b068..b54b3ca 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -715,6 +715,7 @@ config DEFAULT_TCP_CONG
default "reno" if DEFAULT_RENO
default "dctcp" if DEFAULT_DCTCP
default "cdg" if DEFAULT_CDG
+ default "bbr" if DEFAULT_BBR
default "cubic"
config TCP_MD5SIG
^ permalink raw reply related
* Re: [WIP] net+mlx4: auto doorbell
From: Jesper Dangaard Brouer @ 2016-11-30 22:30 UTC (permalink / raw)
To: Eric Dumazet
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat,
brouer
In-Reply-To: <1480534200.18162.203.camel@edumazet-glaptop3.roam.corp.google.com>
On Wed, 30 Nov 2016 11:30:00 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
>
> > Don't take is as critique Eric. I was hoping your patch would have
> > solved this issue of being sensitive to TX completion adjustments. You
> > usually have good solutions for difficult issues. I basically rejected
> > Achiad's approach/patch because it was too sensitive to these kind of
> > adjustments.
>
> Well, this patch can hurt latencies, because a doorbell can be delayed,
> and softirqs can be delayed by many hundred of usec in some cases.
>
> I would not enable this behavior by default.
What about another scheme, where dev_hard_start_xmit() can return an
indication that driver choose not to flush (based on TX queue depth),
and there by requesting stack to call flush at a later point.
Would that introduce less latency issues?
Patch muckup (not even compile tested):
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ffcd874cc20..d7d15e4e6766 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -109,6 +109,7 @@ enum netdev_tx {
__NETDEV_TX_MIN = INT_MIN, /* make sure enum is signed */
NETDEV_TX_OK = 0x00, /* driver took care of packet */
NETDEV_TX_BUSY = 0x10, /* driver tx path was busy*/
+ NETDEV_TX_FLUSHME= 0x04, /* driver request doorbell/flush later */
};
typedef enum netdev_tx netdev_tx_t;
@@ -536,6 +537,8 @@ enum netdev_queue_state_t {
__QUEUE_STATE_DRV_XOFF,
__QUEUE_STATE_STACK_XOFF,
__QUEUE_STATE_FROZEN,
+ // __QUEUE_STATE_NEED_FLUSH
+ // is is better to store in txq state?
};
#define QUEUE_STATE_DRV_XOFF (1 << __QUEUE_STATE_DRV_XOFF)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e6aa0a249672..7480e44c5a50 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -75,6 +75,7 @@ struct Qdisc {
void *u32_node;
struct netdev_queue *dev_queue;
+ struct netdev_queue *flush_dev_queue; // store txq to flush here?
struct gnet_stats_rate_est64 rate_est;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
@@ -98,6 +99,20 @@ struct Qdisc {
spinlock_t busylock ____cacheline_aligned_in_smp;
};
+static inline void qdisc_request_txq_flush(struct Qdisc *qdisc,
+ struct netdev_queue *txq)
+{
+ struct net_device dev;
+
+ if (qdisc->flush_dev_queue) {
+ if (likely(qdisc->flush_dev_queue == txq))
+ return;
+ /* Flush existing txq before reassignment */
+ dev_flush_xmit(qdisc_dev(q), txq);
+ }
+ qdisc->flush_dev_queue = txq;
+}
+
static inline bool qdisc_is_running(const struct Qdisc *qdisc)
{
return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
@@ -117,6 +132,19 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
+ /* flush device txq here, if needed */
+ if (qdisc->flush_dev_queue) {
+ struct netdev_queue *txq = qdisc->flush_dev_queue;
+ struct net_device *dev = qdisc_dev(q);
+
+ qdisc->flush_dev_queue = NULL;
+ dev_flush_xmit(dev, txq);
+ /*
+ * DISCUSS: it is too soon to flush here? What about
+ * rescheduling a NAPI poll cycle for this device,
+ * before calling flush.
+ */
+ }
write_seqcount_end(&qdisc->running);
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 048b46b7c92a..70339c267f33 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2880,6 +2880,15 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
}
EXPORT_SYMBOL(netif_skb_features);
+static int dev_flush_xmit(struct net_device *dev,
+ struct netdev_queue *txq)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ ops->ndo_flush_xmit(dev, txq);
+ // Oh oh, do we need to take HARD_TX_LOCK ??
+}
+
static int xmit_one(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq, bool more)
{
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6cfb6e9038c2..55c01b6f6311 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -190,6 +190,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
+ if (ret == NETDEV_TX_FLUSHME) {
+ /* Driver choose no-TX-doorbell MMIO write.
+ * This made taking qdisc root_lock less expensive.
+ */
+ qdisc_request_txq_flush(q, txq);
+ // Flush happens later in qdisc_run_end()
+ }
ret = qdisc_qlen(q);
} else {
/* Driver returned NETDEV_TX_BUSY - requeue skb */
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply related
* Re: [PATCH 4/6] net: ethernet: ti: cpts: add ptp pps support
From: Richard Cochran @ 2016-11-30 22:17 UTC (permalink / raw)
To: Grygorii Strashko
Cc: Murali Karicheri, Wingman Kwok, David S. Miller, netdev,
Mugunthan V N, Sekhar Nori, linux-kernel, linux-omap, Rob Herring,
devicetree
In-Reply-To: <875d4cc2-8a47-b06d-fb46-0cacc28dbaee@ti.com>
On Wed, Nov 30, 2016 at 02:43:57PM -0600, Grygorii Strashko wrote:
> > In order to produce the PPS edge correctly, you would have to adjust
> > the comparison value whenever cc.mult changes,
>
> yes. And that is done in cpts_ptp_adjfreq()
> if (cpts->ts_comp_enabled)
> cpts->ts_comp_one_sec_cycs = cpts_cc_ns2cyc(cpts, NSEC_PER_SEC);
> ^^^ re-calculate reload value for
>
> cpts_ts_comp_settime(cpts, ns);
> ^^^ adjust the ts_comp
And it races with the pulse itself. You forgot about this part:
> @@ -172,14 +232,31 @@ static int cpts_ptp_adjfreq(struct ptp_clock_info *ptp, s32 ppb)
> adj *= ppb;
> diff = div_u64(adj, 1000000000ULL);
>
> + mutex_lock(&cpts->ptp_clk_mutex);
> +
> spin_lock_irqsave(&cpts->lock, flags);
> + if (cpts->ts_comp_enabled) {
> + cpts_ts_comp_disable(cpts);
Sorry, but this is a train wreck.
> > but of course this is unworkable.
> >
>
> Sry, but this is questionable - code for pps comes from TI internal
> branches (SDK releases) where it survived for a pretty long time.
That doesn't mean the code is any good. If you adjust at the right
moment, then no pulse occurs at all!
> I'm, of course, agree that without HW support for freq adjustment
> this PPS feature is not super precise and has some limitation,
> but that is what we agree to live with.
I do NOT agree to live with this. I am one who is going to have to
explain to the world why their beagle bone PPS sucks.
Thanks,
Richard
^ permalink raw reply
* [PATCH] net: atarilance: use %8ph for printing hex string
From: Rasmus Villemoes @ 2016-11-30 22:02 UTC (permalink / raw)
To: Felipe Balbi; +Cc: Rasmus Villemoes, netdev, linux-kernel
This is already using the %pM printf extension; might as well also use
%ph to make the code smaller.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
drivers/net/ethernet/amd/atarilance.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/amd/atarilance.c b/drivers/net/ethernet/amd/atarilance.c
index d2bc8e5dcd23..d208a0532811 100644
--- a/drivers/net/ethernet/amd/atarilance.c
+++ b/drivers/net/ethernet/amd/atarilance.c
@@ -1013,13 +1013,9 @@ static int lance_rx( struct net_device *dev )
u_char *data = PKTBUF_ADDR(head);
printk(KERN_DEBUG "%s: RX pkt type 0x%04x from %pM to %pM "
- "data %02x %02x %02x %02x %02x %02x %02x %02x "
- "len %d\n",
+ "data %8ph len %d\n",
dev->name, ((u_short *)data)[6],
- &data[6], data,
- data[15], data[16], data[17], data[18],
- data[19], data[20], data[21], data[22],
- pkt_len);
+ &data[6], data, &data[15], pkt_len);
}
skb_reserve( skb, 2 ); /* 16 byte align */
--
2.1.4
^ permalink raw reply related
* Re: [PATCH net] cdc_ether: Fix handling connection notification
From: Kristian Evensen @ 2016-11-30 21:59 UTC (permalink / raw)
To: Bjørn Mork
Cc: Oliver Neukum, linux-usb, Network Development, linux-kernel,
Henning Schild
In-Reply-To: <87shq83acr.fsf@miraculix.mork.no>
On Wed, Nov 30, 2016 at 10:51 PM, Bjørn Mork <bjorn@mork.no> wrote:
> Kristian Evensen <kristian.evensen@gmail.com> writes:
>
>> +void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
>> +{
>> + struct usb_cdc_notification *event;
>> +
>> + if (urb->actual_length < sizeof(*event))
>> + return;
>> +
>> + event = urb->transfer_buffer;
>> +
>> + if (event->bNotificationType != USB_CDC_NOTIFY_NETWORK_CONNECTION) {
>> + usbnet_cdc_status(dev, urb);
>> + return;
>> + }
>> +
>> + netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
>> + event->wValue ? "on" : "off");
>> +
>> + if (event->wValue &&
>> + !test_bit(__LINK_STATE_NOCARRIER, &dev->net->state))
>> + usbnet_link_change(dev, 0, 0);
>> +
>> + usbnet_link_change(dev, !!event->wValue, 0);
>> +}
>
> As Henning said: Use netif_carrier_ok instead of open coding it.
>
> But I also think you need to replace the first usbnet_link_change() with
> a plain netif_carrier_off(dev->net). Calling usbnet_link_change() twice
> here is only going to set off the "kevent XX may have been dropped"
> message since you call schedule_work() twice without giving the work
> queue a chance to be processed. No need to do that.
Thanks for the feedback and agree. Will submit a v2 tomorrow.
-Kristian
^ permalink raw reply
* [PATCH 09/11] netfilter: nat: fix crash when conntrack entry is re-used
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
Stas Nichiporovich reports oops in nf_nat_bysource_cmp(), trying to
access nf_conn struct at address 0xffffffffffffff50.
This is the result of fetching a null rhash list (struct embedded at
offset 176; 0 - 176 gets us ...fff50).
The problem is that conntrack entries are allocated from a
SLAB_DESTROY_BY_RCU cache, i.e. entries can be free'd and reused
on another cpu while nf nat bysource hash access the same conntrack entry.
Freeing is fine (we hold rcu read lock); zeroing rhlist_head isn't.
-> Move the rhlist struct outside of the memset()-inited area.
Fixes: 7c9664351980aaa6a ("netfilter: move nat hlist_head to nf_conn")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index dc143ada9762..d9d52c020a70 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -100,6 +100,9 @@ struct nf_conn {
possible_net_t ct_net;
+#if IS_ENABLED(CONFIG_NF_NAT)
+ struct rhlist_head nat_bysource;
+#endif
/* all members below initialized via memset */
u8 __nfct_init_offset[0];
@@ -117,9 +120,6 @@ struct nf_conn {
/* Extensions */
struct nf_ct_ext *ext;
-#if IS_ENABLED(CONFIG_NF_NAT)
- struct rhlist_head nat_bysource;
-#endif
/* Storage reserved for other modules, must be the last member */
union nf_conntrack_proto proto;
};
--
2.1.4
^ permalink raw reply related
* [PATCH 10/11] netfilter: ipv6: nf_defrag: drop mangled skb on ream error
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
Dmitry Vyukov reported GPF in network stack that Andrey traced down to
negative nh offset in nf_ct_frag6_queue().
Problem is that all network headers before fragment header are pulled.
Normal ipv6 reassembly will drop the skb when errors occur further down
the line.
netfilter doesn't do this, and instead passed the original fragment
along. That was also fine back when netfilter ipv6 defrag worked with
cloned fragments, as the original, pristine fragment was passed on.
So we either have to undo the pull op, or discard such fragments.
Since they're malformed after all (e.g. overlapping fragment) it seems
preferrable to just drop them.
Same for temporary errors -- it doesn't make sense to accept (and
perhaps forward!) only some fragments of same datagram.
Fixes: 029f7f3b8701cc7ac ("netfilter: ipv6: nf_defrag: avoid/free clone operations")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Debugged-by: Andrey Konovalov <andreyknvl@google.com>
Diagnosed-by: Eric Dumazet <Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv6/netfilter/nf_conntrack_reasm.c | 4 ++--
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index e4347aeb2e65..9948b5ce52da 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -576,11 +576,11 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
/* Jumbo payload inhibits frag. header */
if (ipv6_hdr(skb)->payload_len == 0) {
pr_debug("payload len = 0\n");
- return -EINVAL;
+ return 0;
}
if (find_prev_fhdr(skb, &prevhdr, &nhoff, &fhoff) < 0)
- return -EINVAL;
+ return 0;
if (!pskb_may_pull(skb, fhoff + sizeof(*fhdr)))
return -ENOMEM;
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index f7aab5ab93a5..f06b0471f39f 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -69,7 +69,7 @@ static unsigned int ipv6_defrag(void *priv,
if (err == -EINPROGRESS)
return NF_STOLEN;
- return NF_ACCEPT;
+ return err == 0 ? NF_ACCEPT : NF_DROP;
}
static struct nf_hook_ops ipv6_defrag_ops[] = {
--
2.1.4
^ permalink raw reply related
* [PATCH 11/11] netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Hongxu Jia <hongxu.jia@windriver.com>
Since 09d9686047db ("netfilter: x_tables: do compat validation via
translate_table"), it used compatr structure to assign newinfo
structure. In translate_compat_table of ip_tables.c and ip6_tables.c,
it used compatr->hook_entry to replace info->hook_entry and
compatr->underflow to replace info->underflow, but not do the same
replacement in arp_tables.c.
It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit
kernel.
--------------------------------------
root@qemux86-64:~# arptables -P INPUT ACCEPT
root@qemux86-64:~# arptables -P INPUT ACCEPT
ERROR: Policy for `INPUT' offset 448 != underflow 0
arptables: Incompatible with this kernel
--------------------------------------
Fixes: 09d9686047db ("netfilter: x_tables: do compat validation via translate_table")
Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv4/netfilter/arp_tables.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index b31df597fd37..697538464e6e 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -1201,8 +1201,8 @@ static int translate_compat_table(struct xt_table_info **pinfo,
newinfo->number = compatr->num_entries;
for (i = 0; i < NF_ARP_NUMHOOKS; i++) {
- newinfo->hook_entry[i] = info->hook_entry[i];
- newinfo->underflow[i] = info->underflow[i];
+ newinfo->hook_entry[i] = compatr->hook_entry[i];
+ newinfo->underflow[i] = compatr->underflow[i];
}
entry1 = newinfo->entries;
pos = entry1;
--
2.1.4
^ permalink raw reply related
* [PATCH 08/11] netfilter: nft_range: add the missing NULL pointer check
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Liping Zhang <zlpnobody@gmail.com>
Otherwise, kernel panic will happen if the user does not specify
the related attributes.
Fixes: 0f3cd9b36977 ("netfilter: nf_tables: add range expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nft_range.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/netfilter/nft_range.c b/net/netfilter/nft_range.c
index fbc88009ca2e..8f0aaaea1376 100644
--- a/net/netfilter/nft_range.c
+++ b/net/netfilter/nft_range.c
@@ -59,6 +59,12 @@ static int nft_range_init(const struct nft_ctx *ctx, const struct nft_expr *expr
int err;
u32 op;
+ if (!tb[NFTA_RANGE_SREG] ||
+ !tb[NFTA_RANGE_OP] ||
+ !tb[NFTA_RANGE_FROM_DATA] ||
+ !tb[NFTA_RANGE_TO_DATA])
+ return -EINVAL;
+
err = nft_data_init(NULL, &priv->data_from, sizeof(priv->data_from),
&desc_from, tb[NFTA_RANGE_FROM_DATA]);
if (err < 0)
--
2.1.4
^ permalink raw reply related
* [PATCH 04/11] netfilter: nft_hash: validate maximum value of u32 netlink hash attribute
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Laura Garcia Liebana <nevola@gmail.com>
Use the function nft_parse_u32_check() to fetch the value and validate
the u32 attribute into the hash len u8 field.
This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").
Fixes: cb1b69b0b15b ("netfilter: nf_tables: add hash expression")
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nft_hash.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index baf694de3935..d5447a22275c 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -53,6 +53,7 @@ static int nft_hash_init(const struct nft_ctx *ctx,
{
struct nft_hash *priv = nft_expr_priv(expr);
u32 len;
+ int err;
if (!tb[NFTA_HASH_SREG] ||
!tb[NFTA_HASH_DREG] ||
@@ -67,8 +68,10 @@ static int nft_hash_init(const struct nft_ctx *ctx,
priv->sreg = nft_parse_register(tb[NFTA_HASH_SREG]);
priv->dreg = nft_parse_register(tb[NFTA_HASH_DREG]);
- len = ntohl(nla_get_be32(tb[NFTA_HASH_LEN]));
- if (len == 0 || len > U8_MAX)
+ err = nft_parse_u32_check(tb[NFTA_HASH_LEN], U8_MAX, &len);
+ if (err < 0)
+ return err;
+ if (len == 0)
return -ERANGE;
priv->len = len;
--
2.1.4
^ permalink raw reply related
* [PATCH 02/11] netfilter: Update nf_send_reset6 to consider L3 domain
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: David Ahern <dsa@cumulusnetworks.com>
nf_send_reset6 is not considering the L3 domain and lookups are sent
to the wrong table. For example consider the following output rule:
ip6tables -A OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
using perf to analyze lookups via the fib6_table_lookup tracepoint shows:
swapper 0 [001] 248.787816: fib6:fib6_table_lookup: table 255 oif 0 iif 1 src 2100:1::3 dst 2100:1:
ffffffff81439cdc perf_trace_fib6_table_lookup ([kernel.kallsyms])
ffffffff814c1ce3 trace_fib6_table_lookup ([kernel.kallsyms])
ffffffff814c3e89 ip6_pol_route ([kernel.kallsyms])
ffffffff814c40d5 ip6_pol_route_output ([kernel.kallsyms])
ffffffff814e7b6f fib6_rule_action ([kernel.kallsyms])
ffffffff81437f60 fib_rules_lookup ([kernel.kallsyms])
ffffffff814e7c79 fib6_rule_lookup ([kernel.kallsyms])
ffffffff814c2541 ip6_route_output_flags ([kernel.kallsyms])
528 nf_send_reset6 ([nf_reject_ipv6])
The lookup is directed to table 255 rather than the table associated with
the device via the L3 domain. Update nf_send_reset6 to pull the L3 domain
from the dst currently attached to the skb.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv6/netfilter/nf_reject_ipv6.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index a5400223fd74..10090400c72f 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -156,6 +156,7 @@ void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
fl6.daddr = oip6h->saddr;
fl6.fl6_sport = otcph->dest;
fl6.fl6_dport = otcph->source;
+ fl6.flowi6_oif = l3mdev_master_ifindex(skb_dst(oldskb)->dev);
security_skb_classify_flow(oldskb, flowi6_to_flowi(&fl6));
dst = ip6_route_output(net, NULL, &fl6);
if (dst->error) {
--
2.1.4
^ permalink raw reply related
* [PATCH 07/11] netfilter: nf_tables: fix inconsistent element expiration calculation
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: "Anders K. Pedersen" <akp@cohaesio.com>
As Liping Zhang reports, after commit a8b1e36d0d1d ("netfilter: nft_dynset:
fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
while set->timeout was stored in milliseconds. This is inconsistent and
incorrect.
Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
priv->timeout will be converted to jiffies twice.
Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
set->timeout will be used, but we forget to call msecs_to_jiffies
when do update elements.
Fix this by using jiffies internally for traditional sets and doing the
conversions to/from msec when interacting with userspace - as dynset
already does.
This is preferable to doing the conversions, when elements are inserted or
updated, because this can happen very frequently on busy dynsets.
Fixes: a8b1e36d0d1d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Reported-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Acked-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_tables.h | 2 +-
net/netfilter/nf_tables_api.c | 14 +++++++++-----
2 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index d79d1e9b9546..b02af0bf5777 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -313,7 +313,7 @@ void nft_unregister_set(struct nft_set_ops *ops);
* @size: maximum set size
* @nelems: number of elements
* @ndeact: number of deactivated elements queued for removal
- * @timeout: default timeout value in msecs
+ * @timeout: default timeout value in jiffies
* @gc_int: garbage collection interval in msecs
* @policy: set parameterization (see enum nft_set_policies)
* @udlen: user data length
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 026581b04ea8..e5194f6f906c 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2570,7 +2570,8 @@ static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
}
if (set->timeout &&
- nla_put_be64(skb, NFTA_SET_TIMEOUT, cpu_to_be64(set->timeout),
+ nla_put_be64(skb, NFTA_SET_TIMEOUT,
+ cpu_to_be64(jiffies_to_msecs(set->timeout)),
NFTA_SET_PAD))
goto nla_put_failure;
if (set->gc_int &&
@@ -2859,7 +2860,8 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk,
if (nla[NFTA_SET_TIMEOUT] != NULL) {
if (!(flags & NFT_SET_TIMEOUT))
return -EINVAL;
- timeout = be64_to_cpu(nla_get_be64(nla[NFTA_SET_TIMEOUT]));
+ timeout = msecs_to_jiffies(be64_to_cpu(nla_get_be64(
+ nla[NFTA_SET_TIMEOUT])));
}
gc_int = 0;
if (nla[NFTA_SET_GC_INTERVAL] != NULL) {
@@ -3178,7 +3180,8 @@ static int nf_tables_fill_setelem(struct sk_buff *skb,
if (nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT) &&
nla_put_be64(skb, NFTA_SET_ELEM_TIMEOUT,
- cpu_to_be64(*nft_set_ext_timeout(ext)),
+ cpu_to_be64(jiffies_to_msecs(
+ *nft_set_ext_timeout(ext))),
NFTA_SET_ELEM_PAD))
goto nla_put_failure;
@@ -3447,7 +3450,7 @@ void *nft_set_elem_init(const struct nft_set *set,
memcpy(nft_set_ext_data(ext), data, set->dlen);
if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPIRATION))
*nft_set_ext_expiration(ext) =
- jiffies + msecs_to_jiffies(timeout);
+ jiffies + timeout;
if (nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT))
*nft_set_ext_timeout(ext) = timeout;
@@ -3535,7 +3538,8 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
if (nla[NFTA_SET_ELEM_TIMEOUT] != NULL) {
if (!(set->flags & NFT_SET_TIMEOUT))
return -EINVAL;
- timeout = be64_to_cpu(nla_get_be64(nla[NFTA_SET_ELEM_TIMEOUT]));
+ timeout = msecs_to_jiffies(be64_to_cpu(nla_get_be64(
+ nla[NFTA_SET_ELEM_TIMEOUT])));
} else if (set->flags & NFT_SET_TIMEOUT) {
timeout = set->timeout;
}
--
2.1.4
^ permalink raw reply related
* [PATCH 05/11] netfilter: nat: fix cmp return value
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
The comparator works like memcmp, i.e. 0 means objects are equal.
In other words, when objects are distinct they are treated as identical,
when they are distinct they are allegedly the same.
The first case is rare (distinct objects are unlikely to get hashed to
same bucket).
The second case results in unneeded port conflict resolutions attempts.
Fixes: 870190a9ec907 ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_nat_core.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index bbb8f3df79f7..c632429706eb 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -193,9 +193,12 @@ static int nf_nat_bysource_cmp(struct rhashtable_compare_arg *arg,
const struct nf_nat_conn_key *key = arg->key;
const struct nf_conn *ct = obj;
- return same_src(ct, key->tuple) &&
- net_eq(nf_ct_net(ct), key->net) &&
- nf_ct_zone_equal(ct, key->zone, IP_CT_DIR_ORIGINAL);
+ if (!same_src(ct, key->tuple) ||
+ !net_eq(nf_ct_net(ct), key->net) ||
+ !nf_ct_zone_equal(ct, key->zone, IP_CT_DIR_ORIGINAL))
+ return 1;
+
+ return 0;
}
static struct rhashtable_params nf_nat_bysource_params = {
--
2.1.4
^ permalink raw reply related
* [PATCH 01/11] netfilter: Update ip_route_me_harder to consider L3 domain
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: David Ahern <dsa@cumulusnetworks.com>
ip_route_me_harder is not considering the L3 domain and sending lookups
to the wrong table. For example consider the following output rule:
iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
using perf to analyze lookups via the fib_table_lookup tracepoint shows:
vrf-test 1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])
and
vrf-test 1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
ffffffff81499758 __fib_lookup ([kernel.kallsyms])
ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])
In both cases the lookups are directed to table 255 rather than the
table associated with the device via the L3 domain. Update both
lookups to pull the L3 domain from the dst currently attached to the
skb.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv4/netfilter.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index c3776ff6749f..b3cc1335adbc 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -24,10 +24,11 @@ int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned int addr_t
struct flowi4 fl4 = {};
__be32 saddr = iph->saddr;
__u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0;
+ struct net_device *dev = skb_dst(skb)->dev;
unsigned int hh_len;
if (addr_type == RTN_UNSPEC)
- addr_type = inet_addr_type(net, saddr);
+ addr_type = inet_addr_type_dev_table(net, dev, saddr);
if (addr_type == RTN_LOCAL || addr_type == RTN_UNICAST)
flags |= FLOWI_FLAG_ANYSRC;
else
@@ -40,6 +41,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned int addr_t
fl4.saddr = saddr;
fl4.flowi4_tos = RT_TOS(iph->tos);
fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
+ if (!fl4.flowi4_oif)
+ fl4.flowi4_oif = l3mdev_master_ifindex(dev);
fl4.flowi4_mark = skb->mark;
fl4.flowi4_flags = flags;
rt = ip_route_output_key(net, &fl4);
--
2.1.4
^ permalink raw reply related
* [PATCH 06/11] netfilter: nat: switch to new rhlist interface
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
I got offlist bug report about failing connections and high cpu usage.
This happens because we hit 'elasticity' checks in rhashtable that
refuses bucket list exceeding 16 entries.
The nat bysrc hash unfortunately needs to insert distinct objects that
share same key and are identical (have same source tuple), this cannot
be avoided.
Switch to the rhlist interface which is designed for this.
The nulls_base is removed here, I don't think its needed:
A (unlikely) false positive results in unneeded port clash resolution,
a false negative results in packet drop during conntrack confirmation,
when we try to insert the duplicate into main conntrack hash table.
Tested by adding multiple ip addresses to host, then adding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
... and then creating multiple connections, from same source port but
different addresses:
for i in $(seq 2000 2032);do nc -p 1234 192.168.7.1 $i > /dev/null & done
(all of these then get hashed to same bysource slot)
Then, to test that nat conflict resultion is working:
nc -s 10.0.0.1 -p 1234 192.168.7.1 2000
nc -s 10.0.0.2 -p 1234 192.168.7.1 2000
tcp .. src=10.0.0.1 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1024 [ASSURED]
tcp .. src=10.0.0.2 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1025 [ASSURED]
tcp .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1234 [ASSURED]
tcp .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2001 src=192.168.7.1 dst=192.168.7.10 sport=2001 dport=1234 [ASSURED]
[..]
-> nat altered source ports to 1024 and 1025, respectively.
This can also be confirmed on destination host which shows
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1024
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1025
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1234
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Fixes: 870190a9ec907 ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack.h | 2 +-
net/netfilter/nf_nat_core.c | 40 +++++++++++++++++++++---------------
2 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 50418052a520..dc143ada9762 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -118,7 +118,7 @@ struct nf_conn {
struct nf_ct_ext *ext;
#if IS_ENABLED(CONFIG_NF_NAT)
- struct rhash_head nat_bysource;
+ struct rhlist_head nat_bysource;
#endif
/* Storage reserved for other modules, must be the last member */
union nf_conntrack_proto proto;
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index c632429706eb..5b9c884a452e 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -42,7 +42,7 @@ struct nf_nat_conn_key {
const struct nf_conntrack_zone *zone;
};
-static struct rhashtable nf_nat_bysource_table;
+static struct rhltable nf_nat_bysource_table;
inline const struct nf_nat_l3proto *
__nf_nat_l3proto_find(u8 family)
@@ -207,7 +207,6 @@ static struct rhashtable_params nf_nat_bysource_params = {
.obj_cmpfn = nf_nat_bysource_cmp,
.nelem_hint = 256,
.min_size = 1024,
- .nulls_base = (1U << RHT_BASE_SHIFT),
};
/* Only called for SRC manip */
@@ -226,12 +225,15 @@ find_appropriate_src(struct net *net,
.tuple = tuple,
.zone = zone
};
+ struct rhlist_head *hl;
- ct = rhashtable_lookup_fast(&nf_nat_bysource_table, &key,
- nf_nat_bysource_params);
- if (!ct)
+ hl = rhltable_lookup(&nf_nat_bysource_table, &key,
+ nf_nat_bysource_params);
+ if (!hl)
return 0;
+ ct = container_of(hl, typeof(*ct), nat_bysource);
+
nf_ct_invert_tuplepr(result,
&ct->tuplehash[IP_CT_DIR_REPLY].tuple);
result->dst = tuple->dst;
@@ -449,11 +451,17 @@ nf_nat_setup_info(struct nf_conn *ct,
}
if (maniptype == NF_NAT_MANIP_SRC) {
+ struct nf_nat_conn_key key = {
+ .net = nf_ct_net(ct),
+ .tuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+ .zone = nf_ct_zone(ct),
+ };
int err;
- err = rhashtable_insert_fast(&nf_nat_bysource_table,
- &ct->nat_bysource,
- nf_nat_bysource_params);
+ err = rhltable_insert_key(&nf_nat_bysource_table,
+ &key,
+ &ct->nat_bysource,
+ nf_nat_bysource_params);
if (err)
return NF_DROP;
}
@@ -570,8 +578,8 @@ static int nf_nat_proto_clean(struct nf_conn *ct, void *data)
* will delete entry from already-freed table.
*/
ct->status &= ~IPS_NAT_DONE_MASK;
- rhashtable_remove_fast(&nf_nat_bysource_table, &ct->nat_bysource,
- nf_nat_bysource_params);
+ rhltable_remove(&nf_nat_bysource_table, &ct->nat_bysource,
+ nf_nat_bysource_params);
/* don't delete conntrack. Although that would make things a lot
* simpler, we'd end up flushing all conntracks on nat rmmod.
@@ -701,8 +709,8 @@ static void nf_nat_cleanup_conntrack(struct nf_conn *ct)
if (!nat)
return;
- rhashtable_remove_fast(&nf_nat_bysource_table, &ct->nat_bysource,
- nf_nat_bysource_params);
+ rhltable_remove(&nf_nat_bysource_table, &ct->nat_bysource,
+ nf_nat_bysource_params);
}
static struct nf_ct_ext_type nat_extend __read_mostly = {
@@ -837,13 +845,13 @@ static int __init nf_nat_init(void)
{
int ret;
- ret = rhashtable_init(&nf_nat_bysource_table, &nf_nat_bysource_params);
+ ret = rhltable_init(&nf_nat_bysource_table, &nf_nat_bysource_params);
if (ret)
return ret;
ret = nf_ct_extend_register(&nat_extend);
if (ret < 0) {
- rhashtable_destroy(&nf_nat_bysource_table);
+ rhltable_destroy(&nf_nat_bysource_table);
printk(KERN_ERR "nf_nat_core: Unable to register extension\n");
return ret;
}
@@ -867,7 +875,7 @@ static int __init nf_nat_init(void)
return 0;
cleanup_extend:
- rhashtable_destroy(&nf_nat_bysource_table);
+ rhltable_destroy(&nf_nat_bysource_table);
nf_ct_extend_unregister(&nat_extend);
return ret;
}
@@ -886,7 +894,7 @@ static void __exit nf_nat_cleanup(void)
for (i = 0; i < NFPROTO_NUMPROTO; i++)
kfree(nf_nat_l4protos[i]);
- rhashtable_destroy(&nf_nat_bysource_table);
+ rhltable_destroy(&nf_nat_bysource_table);
}
MODULE_LICENSE("GPL");
--
2.1.4
^ permalink raw reply related
* [PATCH 03/11] netfilter: fix nf_conntrack_helper documentation
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
Since kernel 4.7 this defaults to off.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
Documentation/networking/nf_conntrack-sysctl.txt | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt
index 399e4e866a9c..433b6724797a 100644
--- a/Documentation/networking/nf_conntrack-sysctl.txt
+++ b/Documentation/networking/nf_conntrack-sysctl.txt
@@ -62,10 +62,13 @@ nf_conntrack_generic_timeout - INTEGER (seconds)
protocols.
nf_conntrack_helper - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ 0 - disabled (default)
+ not 0 - enabled
Enable automatic conntrack helper assignment.
+ If disabled it is required to set up iptables rules to assign
+ helpers to connections. See the CT target description in the
+ iptables-extensions(8) man page for further information.
nf_conntrack_icmp_timeout - INTEGER (seconds)
default 30
--
2.1.4
^ permalink raw reply related
* [PATCH 00/11] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2016-11-30 21:57 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
Hi David,
This is a large batch of Netfilter fixes for net, they are:
1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
structure that allows to have several objects with the same key.
Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
expecting a return value similar to memcmp(). Change location of
the nat_bysource field in the nf_conn structure to avoid zeroing
this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
to crashes. From Florian Westphal.
2) Don't allow malformed fragments go through in IPv6, drop them,
otherwise we hit GPF, patch from Florian Westphal.
3) Fix crash if attributes are missing in nft_range, from Liping Zhang.
4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.
5) Two patches from David Ahern to fix netfilter interaction with vrf.
From David Ahern.
6) Fix element timeout calculation in nf_tables, we take milliseconds
from userspace, but we use jiffies from kernelspace. Patch from
Anders K. Pedersen.
7) Missing validation length netlink attribute for nft_hash, from
Laura Garcia.
8) Fix nf_conntrack_helper documentation, we don't default to off
anymore for a bit of time so let's get this in sync with the code.
I know is late but I think these are important, specifically the NAT
bits, as they are mostly addressing fallout from recent changes. I also
read there are chances to have -rc8, if that is the case, that would
also give us a bit more time to test this.
You can pull these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git
Thanks!
----------------------------------------------------------------
The following changes since commit b6e01232e25629907df9db19f25da7d4e8f5b589:
net/mlx4_en: Free netdev resources under state lock (2016-11-23 20:18:36 -0500)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD
for you to fetch changes up to 17a49cd549d9dc8707dc9262210166455c612dde:
netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel (2016-11-30 20:50:23 +0100)
----------------------------------------------------------------
Anders K. Pedersen (1):
netfilter: nf_tables: fix inconsistent element expiration calculation
David Ahern (2):
netfilter: Update ip_route_me_harder to consider L3 domain
netfilter: Update nf_send_reset6 to consider L3 domain
Florian Westphal (5):
netfilter: fix nf_conntrack_helper documentation
netfilter: nat: fix cmp return value
netfilter: nat: switch to new rhlist interface
netfilter: nat: fix crash when conntrack entry is re-used
netfilter: ipv6: nf_defrag: drop mangled skb on ream error
Hongxu Jia (1):
netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel
Laura Garcia Liebana (1):
netfilter: nft_hash: validate maximum value of u32 netlink hash attribute
Liping Zhang (1):
netfilter: nft_range: add the missing NULL pointer check
Documentation/networking/nf_conntrack-sysctl.txt | 7 +++-
include/net/netfilter/nf_conntrack.h | 6 +--
include/net/netfilter/nf_tables.h | 2 +-
net/ipv4/netfilter.c | 5 ++-
net/ipv4/netfilter/arp_tables.c | 4 +-
net/ipv6/netfilter/nf_conntrack_reasm.c | 4 +-
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 2 +-
net/ipv6/netfilter/nf_reject_ipv6.c | 1 +
net/netfilter/nf_nat_core.c | 49 +++++++++++++++---------
net/netfilter/nf_tables_api.c | 14 ++++---
net/netfilter/nft_hash.c | 7 +++-
net/netfilter/nft_range.c | 6 +++
12 files changed, 69 insertions(+), 38 deletions(-)
^ permalink raw reply
* Re: [PATCH net] cdc_ether: Fix handling connection notification
From: Bjørn Mork @ 2016-11-30 21:51 UTC (permalink / raw)
To: Kristian Evensen; +Cc: oliver, linux-usb, netdev, linux-kernel, henning.schild
In-Reply-To: <20161130184216.4224-1-kristian.evensen@gmail.com>
Kristian Evensen <kristian.evensen@gmail.com> writes:
> +void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
> +{
> + struct usb_cdc_notification *event;
> +
> + if (urb->actual_length < sizeof(*event))
> + return;
> +
> + event = urb->transfer_buffer;
> +
> + if (event->bNotificationType != USB_CDC_NOTIFY_NETWORK_CONNECTION) {
> + usbnet_cdc_status(dev, urb);
> + return;
> + }
> +
> + netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
> + event->wValue ? "on" : "off");
> +
> + if (event->wValue &&
> + !test_bit(__LINK_STATE_NOCARRIER, &dev->net->state))
> + usbnet_link_change(dev, 0, 0);
> +
> + usbnet_link_change(dev, !!event->wValue, 0);
> +}
As Henning said: Use netif_carrier_ok instead of open coding it.
But I also think you need to replace the first usbnet_link_change() with
a plain netif_carrier_off(dev->net). Calling usbnet_link_change() twice
here is only going to set off the "kevent XX may have been dropped"
message since you call schedule_work() twice without giving the work
queue a chance to be processed. No need to do that.
Bjørn
^ permalink raw reply
* Re: [PATCH v2 1/7] net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
From: Rob Herring @ 2016-11-30 21:44 UTC (permalink / raw)
To: Martin Blumenstingl
Cc: linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
devicetree-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
davem-fT/PcQaiUtIeIZ0/mPfg9Q, khilman-rdvid1DuHRBWk0Htik3J/w,
mark.rutland-5wv7dgnIgG8,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
alexandre.torgue-qxv4g6HH51o, peppe.cavallaro-qxv4g6HH51o,
will.deacon-5wv7dgnIgG8, catalin.marinas-5wv7dgnIgG8,
carlo-KA+7E9HrN00dnm+yROfE0A, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161125130156.17879-2-martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
On Fri, Nov 25, 2016 at 02:01:50PM +0100, Martin Blumenstingl wrote:
> This allows configuring the RGMII TX clock delay. The RGMII clock is
> generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue.
> The configuration depends on the actual hardware (no delay may be
> needed due to the design of the actual circuit, the PHY might add this
> delay, etc.).
>
> Signed-off-by: Martin Blumenstingl <martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
> ---
> Documentation/devicetree/bindings/net/meson-dwmac.txt | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt b/Documentation/devicetree/bindings/net/meson-dwmac.txt
> index 89e62dd..f8bc540 100644
> --- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
> +++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt
> @@ -25,6 +25,20 @@ Required properties on Meson8b and newer:
> - "clkin0" - first parent clock of the internal mux
> - "clkin1" - second parent clock of the internal mux
>
> +Optional properties on Meson8b and newer:
> +- amlogic,tx-delay-ns: The internal RGMII TX clock delay (provided
> + by this driver) in nanoseconds. Allowed values
> + are: 0ns, 2ns, 4ns, 6ns.
> + This must be configured when the phy-mode is
> + "rgmii" (typically a value of 2ns is used in
> + this case).
> + When phy-mode is set to "rgmii-id" or
> + "rgmii-txid" the TX clock delay is already
> + provided by the PHY. In that case this
> + property should be set to 0ns (which disables
> + the TX clock delay in the MAC to prevent the
> + clock from going off because both PHY and MAC
> + are adding a delay).
What's the default? Why can't no property mean 0ns delay?
>
> Example for Meson6:
>
> --
> 2.10.2
>
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH v5 net-next 6/7] net: mvneta: Add network support for Armada 3700 SoC
From: Gregory CLEMENT @ 2016-11-30 21:42 UTC (permalink / raw)
To: David S. Miller, linux-kernel, netdev
Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
Yelena Krivosheev
In-Reply-To: <cover.0270f6d2413a709521fe2c8c17fbebea6f2e78d1.1480542157.git-series.gregory.clement@free-electrons.com>
From: Marcin Wojtas <mw@semihalf.com>
Armada 3700 is a new ARMv8 SoC from Marvell using same network controller
as older Armada 370/38x/XP. There are however some differences that
needed taking into account when adding support for it:
* open default MBUS window to 4GB of DRAM - Armada 3700 SoC's Mbus
configuration for network controller has to be done on two levels:
global and per-port. The first one is inherited from the
bootloader. The latter can be opened in a default way, leaving
arbitration to the bus controller. Hence filled mbus_dram_target_info
structure is not needed
* make per-CPU operation optional - Recent patches adding RSS and XPS
support for Armada 38x/XP enabled per-CPU operation of the controller
by default. Contrary to older SoC's Armada 3700 SoC's network
controller is not capable of per-CPU processing due to interrupt lines'
connectivity. This patch restores non-per-CPU operation, which is now
optional and depends on neta_armada3700 flag value in mvneta_port
structure. In order not to complicate the code, separate interrupt
subroutine is implemented.
For now, on the Armada 3700, RSS is disabled as the current
implementation depend on the per cpu interrupts.
[gregory.clement@free-electrons.com: extract from a larger patch, replace
some ifdef and port to net-next for v4.10]
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt | 7 +-
drivers/net/ethernet/marvell/Kconfig | 7 +-
drivers/net/ethernet/marvell/mvneta.c | 287 +++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
3 files changed, 214 insertions(+), 87 deletions(-)
diff --git a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
index 73be8970815e..7aa840c8768d 100644
--- a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
+++ b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
@@ -1,7 +1,10 @@
-* Marvell Armada 370 / Armada XP Ethernet Controller (NETA)
+* Marvell Armada 370 / Armada XP / Armada 3700 Ethernet Controller (NETA)
Required properties:
-- compatible: "marvell,armada-370-neta" or "marvell,armada-xp-neta".
+- compatible: could be one of the followings
+ "marvell,armada-370-neta"
+ "marvell,armada-xp-neta"
+ "marvell,armada-3700-neta"
- reg: address and length of the register set for the device.
- interrupts: interrupt for the device
- phy: See ethernet.txt file in the same directory.
diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index 2ccea9dd9248..3b8f11fe5e13 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -56,14 +56,15 @@ config MVNETA_BM_ENABLE
buffer management.
config MVNETA
- tristate "Marvell Armada 370/38x/XP network interface support"
- depends on PLAT_ORION || COMPILE_TEST
+ tristate "Marvell Armada 370/38x/XP/37xx network interface support"
+ depends on ARCH_MVEBU || COMPILE_TEST
depends on HAS_DMA
select MVMDIO
select FIXED_PHY
---help---
This driver supports the network interface units in the
- Marvell ARMADA XP, ARMADA 370 and ARMADA 38x SoC family.
+ Marvell ARMADA XP, ARMADA 370, ARMADA 38x and
+ ARMADA 37xx SoC family.
Note that this driver is distinct from the mv643xx_eth
driver, which should be used for the older Marvell SoCs
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 8ef03fb69bcd..ffc0c65068ea 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -397,6 +397,9 @@ struct mvneta_port {
spinlock_t lock;
bool is_stopped;
+ u32 cause_rx_tx;
+ struct napi_struct napi;
+
/* Core clock */
struct clk *clk;
/* AXI clock */
@@ -422,6 +425,9 @@ struct mvneta_port {
u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
+
+ /* Flags for special SoC configurations */
+ bool neta_armada3700;
u16 rx_offset_correction;
};
@@ -965,14 +971,9 @@ static int mvneta_mbus_io_win_set(struct mvneta_port *pp, u32 base, u32 wsize,
return 0;
}
-/* Assign and initialize pools for port. In case of fail
- * buffer manager will remain disabled for current port.
- */
-static int mvneta_bm_port_init(struct platform_device *pdev,
- struct mvneta_port *pp)
+static int mvneta_bm_port_mbus_init(struct mvneta_port *pp)
{
- struct device_node *dn = pdev->dev.of_node;
- u32 long_pool_id, short_pool_id, wsize;
+ u32 wsize;
u8 target, attr;
int err;
@@ -991,6 +992,25 @@ static int mvneta_bm_port_init(struct platform_device *pdev,
netdev_info(pp->dev, "fail to configure mbus window to BM\n");
return err;
}
+ return 0;
+}
+
+/* Assign and initialize pools for port. In case of fail
+ * buffer manager will remain disabled for current port.
+ */
+static int mvneta_bm_port_init(struct platform_device *pdev,
+ struct mvneta_port *pp)
+{
+ struct device_node *dn = pdev->dev.of_node;
+ u32 long_pool_id, short_pool_id;
+
+ if (!pp->neta_armada3700) {
+ int ret;
+
+ ret = mvneta_bm_port_mbus_init(pp);
+ if (ret)
+ return ret;
+ }
if (of_property_read_u32(dn, "bm,pool-long", &long_pool_id)) {
netdev_info(pp->dev, "missing long pool id\n");
@@ -1359,22 +1379,27 @@ static void mvneta_defaults_set(struct mvneta_port *pp)
for_each_present_cpu(cpu) {
int rxq_map = 0, txq_map = 0;
int rxq, txq;
+ if (!pp->neta_armada3700) {
+ for (rxq = 0; rxq < rxq_number; rxq++)
+ if ((rxq % max_cpu) == cpu)
+ rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq);
+
+ for (txq = 0; txq < txq_number; txq++)
+ if ((txq % max_cpu) == cpu)
+ txq_map |= MVNETA_CPU_TXQ_ACCESS(txq);
+
+ /* With only one TX queue we configure a special case
+ * which will allow to get all the irq on a single
+ * CPU
+ */
+ if (txq_number == 1)
+ txq_map = (cpu == pp->rxq_def) ?
+ MVNETA_CPU_TXQ_ACCESS(1) : 0;
- for (rxq = 0; rxq < rxq_number; rxq++)
- if ((rxq % max_cpu) == cpu)
- rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq);
-
- for (txq = 0; txq < txq_number; txq++)
- if ((txq % max_cpu) == cpu)
- txq_map |= MVNETA_CPU_TXQ_ACCESS(txq);
-
- /* With only one TX queue we configure a special case
- * which will allow to get all the irq on a single
- * CPU
- */
- if (txq_number == 1)
- txq_map = (cpu == pp->rxq_def) ?
- MVNETA_CPU_TXQ_ACCESS(1) : 0;
+ } else {
+ txq_map = MVNETA_CPU_TXQ_ACCESS_ALL_MASK;
+ rxq_map = MVNETA_CPU_RXQ_ACCESS_ALL_MASK;
+ }
mvreg_write(pp, MVNETA_CPU_MAP(cpu), rxq_map | txq_map);
}
@@ -2627,6 +2652,17 @@ static void mvneta_set_rx_mode(struct net_device *dev)
/* Interrupt handling - the callback for request_irq() */
static irqreturn_t mvneta_isr(int irq, void *dev_id)
{
+ struct mvneta_port *pp = (struct mvneta_port *)dev_id;
+
+ mvreg_write(pp, MVNETA_INTR_NEW_MASK, 0);
+ napi_schedule(&pp->napi);
+
+ return IRQ_HANDLED;
+}
+
+/* Interrupt handling - the callback for request_percpu_irq() */
+static irqreturn_t mvneta_percpu_isr(int irq, void *dev_id)
+{
struct mvneta_pcpu_port *port = (struct mvneta_pcpu_port *)dev_id;
disable_percpu_irq(port->pp->dev->irq);
@@ -2674,7 +2710,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
struct mvneta_pcpu_port *port = this_cpu_ptr(pp->ports);
if (!netif_running(pp->dev)) {
- napi_complete(&port->napi);
+ napi_complete(napi);
return rx_done;
}
@@ -2703,7 +2739,8 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
*/
rx_queue = fls(((cause_rx_tx >> 8) & 0xff));
- cause_rx_tx |= port->cause_rx_tx;
+ cause_rx_tx |= pp->neta_armada3700 ? pp->cause_rx_tx :
+ port->cause_rx_tx;
if (rx_queue) {
rx_queue = rx_queue - 1;
@@ -2717,11 +2754,27 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
if (budget > 0) {
cause_rx_tx = 0;
- napi_complete(&port->napi);
- enable_percpu_irq(pp->dev->irq, 0);
+ napi_complete(napi);
+
+ if (pp->neta_armada3700) {
+ unsigned long flags;
+
+ local_irq_save(flags);
+ mvreg_write(pp, MVNETA_INTR_NEW_MASK,
+ MVNETA_RX_INTR_MASK(rxq_number) |
+ MVNETA_TX_INTR_MASK(txq_number) |
+ MVNETA_MISCINTR_INTR_MASK);
+ local_irq_restore(flags);
+ } else {
+ enable_percpu_irq(pp->dev->irq, 0);
+ }
}
- port->cause_rx_tx = cause_rx_tx;
+ if (pp->neta_armada3700)
+ pp->cause_rx_tx = cause_rx_tx;
+ else
+ port->cause_rx_tx = cause_rx_tx;
+
return rx_done;
}
@@ -2991,11 +3044,16 @@ static void mvneta_start_dev(struct mvneta_port *pp)
/* start the Rx/Tx activity */
mvneta_port_enable(pp);
- /* Enable polling on the port */
- for_each_online_cpu(cpu) {
- struct mvneta_pcpu_port *port = per_cpu_ptr(pp->ports, cpu);
+ if (!pp->neta_armada3700) {
+ /* Enable polling on the port */
+ for_each_online_cpu(cpu) {
+ struct mvneta_pcpu_port *port =
+ per_cpu_ptr(pp->ports, cpu);
- napi_enable(&port->napi);
+ napi_enable(&port->napi);
+ }
+ } else {
+ napi_enable(&pp->napi);
}
/* Unmask interrupts. It has to be done from each CPU */
@@ -3017,10 +3075,15 @@ static void mvneta_stop_dev(struct mvneta_port *pp)
phy_stop(ndev->phydev);
- for_each_online_cpu(cpu) {
- struct mvneta_pcpu_port *port = per_cpu_ptr(pp->ports, cpu);
+ if (!pp->neta_armada3700) {
+ for_each_online_cpu(cpu) {
+ struct mvneta_pcpu_port *port =
+ per_cpu_ptr(pp->ports, cpu);
- napi_disable(&port->napi);
+ napi_disable(&port->napi);
+ }
+ } else {
+ napi_disable(&pp->napi);
}
netif_carrier_off(pp->dev);
@@ -3430,31 +3493,37 @@ static int mvneta_open(struct net_device *dev)
goto err_cleanup_rxqs;
/* Connect to port interrupt line */
- ret = request_percpu_irq(pp->dev->irq, mvneta_isr,
- MVNETA_DRIVER_NAME, pp->ports);
+ if (pp->neta_armada3700)
+ ret = request_irq(pp->dev->irq, mvneta_isr, 0,
+ dev->name, pp);
+ else
+ ret = request_percpu_irq(pp->dev->irq, mvneta_percpu_isr,
+ dev->name, pp->ports);
if (ret) {
netdev_err(pp->dev, "cannot request irq %d\n", pp->dev->irq);
goto err_cleanup_txqs;
}
- /* Enable per-CPU interrupt on all the CPU to handle our RX
- * queue interrupts
- */
- on_each_cpu(mvneta_percpu_enable, pp, true);
+ if (!pp->neta_armada3700) {
+ /* Enable per-CPU interrupt on all the CPU to handle our RX
+ * queue interrupts
+ */
+ on_each_cpu(mvneta_percpu_enable, pp, true);
- pp->is_stopped = false;
- /* Register a CPU notifier to handle the case where our CPU
- * might be taken offline.
- */
- ret = cpuhp_state_add_instance_nocalls(online_hpstate,
- &pp->node_online);
- if (ret)
- goto err_free_irq;
+ pp->is_stopped = false;
+ /* Register a CPU notifier to handle the case where our CPU
+ * might be taken offline.
+ */
+ ret = cpuhp_state_add_instance_nocalls(online_hpstate,
+ &pp->node_online);
+ if (ret)
+ goto err_free_irq;
- ret = cpuhp_state_add_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
- &pp->node_dead);
- if (ret)
- goto err_free_online_hp;
+ ret = cpuhp_state_add_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
+ &pp->node_dead);
+ if (ret)
+ goto err_free_online_hp;
+ }
/* In default link is down */
netif_carrier_off(pp->dev);
@@ -3470,13 +3539,20 @@ static int mvneta_open(struct net_device *dev)
return 0;
err_free_dead_hp:
- cpuhp_state_remove_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
- &pp->node_dead);
+ if (!pp->neta_armada3700)
+ cpuhp_state_remove_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
+ &pp->node_dead);
err_free_online_hp:
- cpuhp_state_remove_instance_nocalls(online_hpstate, &pp->node_online);
+ if (!pp->neta_armada3700)
+ cpuhp_state_remove_instance_nocalls(online_hpstate,
+ &pp->node_online);
err_free_irq:
- on_each_cpu(mvneta_percpu_disable, pp, true);
- free_percpu_irq(pp->dev->irq, pp->ports);
+ if (pp->neta_armada3700) {
+ free_irq(pp->dev->irq, pp);
+ } else {
+ on_each_cpu(mvneta_percpu_disable, pp, true);
+ free_percpu_irq(pp->dev->irq, pp->ports);
+ }
err_cleanup_txqs:
mvneta_cleanup_txqs(pp);
err_cleanup_rxqs:
@@ -3489,23 +3565,30 @@ static int mvneta_stop(struct net_device *dev)
{
struct mvneta_port *pp = netdev_priv(dev);
- /* Inform that we are stopping so we don't want to setup the
- * driver for new CPUs in the notifiers. The code of the
- * notifier for CPU online is protected by the same spinlock,
- * so when we get the lock, the notifer work is done.
- */
- spin_lock(&pp->lock);
- pp->is_stopped = true;
- spin_unlock(&pp->lock);
+ if (!pp->neta_armada3700) {
+ /* Inform that we are stopping so we don't want to setup the
+ * driver for new CPUs in the notifiers. The code of the
+ * notifier for CPU online is protected by the same spinlock,
+ * so when we get the lock, the notifer work is done.
+ */
+ spin_lock(&pp->lock);
+ pp->is_stopped = true;
+ spin_unlock(&pp->lock);
- mvneta_stop_dev(pp);
- mvneta_mdio_remove(pp);
+ mvneta_stop_dev(pp);
+ mvneta_mdio_remove(pp);
cpuhp_state_remove_instance_nocalls(online_hpstate, &pp->node_online);
cpuhp_state_remove_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
&pp->node_dead);
- on_each_cpu(mvneta_percpu_disable, pp, true);
- free_percpu_irq(dev->irq, pp->ports);
+ on_each_cpu(mvneta_percpu_disable, pp, true);
+ free_percpu_irq(dev->irq, pp->ports);
+ } else {
+ mvneta_stop_dev(pp);
+ mvneta_mdio_remove(pp);
+ free_irq(dev->irq, pp);
+ }
+
mvneta_cleanup_rxqs(pp);
mvneta_cleanup_txqs(pp);
@@ -3784,6 +3867,11 @@ static int mvneta_ethtool_set_rxfh(struct net_device *dev, const u32 *indir,
const u8 *key, const u8 hfunc)
{
struct mvneta_port *pp = netdev_priv(dev);
+
+ /* Current code for Armada 3700 doesn't support RSS features yet */
+ if (pp->neta_armada3700)
+ return -EOPNOTSUPP;
+
/* We require at least one supported parameter to be changed
* and no change in any of the unsupported parameters
*/
@@ -3804,6 +3892,10 @@ static int mvneta_ethtool_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
{
struct mvneta_port *pp = netdev_priv(dev);
+ /* Current code for Armada 3700 doesn't support RSS features yet */
+ if (pp->neta_armada3700)
+ return -EOPNOTSUPP;
+
if (hfunc)
*hfunc = ETH_RSS_HASH_TOP;
@@ -3911,16 +4003,29 @@ static void mvneta_conf_mbus_windows(struct mvneta_port *pp,
win_enable = 0x3f;
win_protect = 0;
- for (i = 0; i < dram->num_cs; i++) {
- const struct mbus_dram_window *cs = dram->cs + i;
- mvreg_write(pp, MVNETA_WIN_BASE(i), (cs->base & 0xffff0000) |
- (cs->mbus_attr << 8) | dram->mbus_dram_target_id);
+ if (dram) {
+ for (i = 0; i < dram->num_cs; i++) {
+ const struct mbus_dram_window *cs = dram->cs + i;
+
+ mvreg_write(pp, MVNETA_WIN_BASE(i),
+ (cs->base & 0xffff0000) |
+ (cs->mbus_attr << 8) |
+ dram->mbus_dram_target_id);
- mvreg_write(pp, MVNETA_WIN_SIZE(i),
- (cs->size - 1) & 0xffff0000);
+ mvreg_write(pp, MVNETA_WIN_SIZE(i),
+ (cs->size - 1) & 0xffff0000);
- win_enable &= ~(1 << i);
- win_protect |= 3 << (2 * i);
+ win_enable &= ~(1 << i);
+ win_protect |= 3 << (2 * i);
+ }
+ } else {
+ /* For Armada3700 open default 4GB Mbus window, leaving
+ * arbitration of target/attribute to a different layer
+ * of configuration.
+ */
+ mvreg_write(pp, MVNETA_WIN_SIZE(0), 0xffff0000);
+ win_enable &= ~BIT(0);
+ win_protect = 3;
}
mvreg_write(pp, MVNETA_BASE_ADDR_ENABLE, win_enable);
@@ -4050,6 +4155,10 @@ static int mvneta_probe(struct platform_device *pdev)
pp->indir[0] = rxq_def;
+ /* Get special SoC configurations */
+ if (of_device_is_compatible(dn, "marvell,armada-3700-neta"))
+ pp->neta_armada3700 = true;
+
pp->clk = devm_clk_get(&pdev->dev, "core");
if (IS_ERR(pp->clk))
pp->clk = devm_clk_get(&pdev->dev, NULL);
@@ -4117,7 +4226,11 @@ static int mvneta_probe(struct platform_device *pdev)
pp->tx_csum_limit = tx_csum_limit;
dram_target_info = mv_mbus_dram_info();
- if (dram_target_info)
+ /* Armada3700 requires setting default configuration of Mbus
+ * windows, however without using filled mbus_dram_target_info
+ * structure.
+ */
+ if (dram_target_info || pp->neta_armada3700)
mvneta_conf_mbus_windows(pp, dram_target_info);
pp->tx_ring_size = MVNETA_MAX_TXD;
@@ -4150,11 +4263,20 @@ static int mvneta_probe(struct platform_device *pdev)
goto err_netdev;
}
- for_each_present_cpu(cpu) {
- struct mvneta_pcpu_port *port = per_cpu_ptr(pp->ports, cpu);
+ /* Armada3700 network controller does not support per-cpu
+ * operation, so only single NAPI should be initialized.
+ */
+ if (pp->neta_armada3700) {
+ netif_napi_add(dev, &pp->napi, mvneta_poll, NAPI_POLL_WEIGHT);
+ } else {
+ for_each_present_cpu(cpu) {
+ struct mvneta_pcpu_port *port =
+ per_cpu_ptr(pp->ports, cpu);
- netif_napi_add(dev, &port->napi, mvneta_poll, NAPI_POLL_WEIGHT);
- port->pp = pp;
+ netif_napi_add(dev, &port->napi, mvneta_poll,
+ NAPI_POLL_WEIGHT);
+ port->pp = pp;
+ }
}
dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
@@ -4239,6 +4361,7 @@ static int mvneta_remove(struct platform_device *pdev)
static const struct of_device_id mvneta_match[] = {
{ .compatible = "marvell,armada-370-neta" },
{ .compatible = "marvell,armada-xp-neta" },
+ { .compatible = "marvell,armada-3700-neta" },
{ }
};
MODULE_DEVICE_TABLE(of, mvneta_match);
--
git-series 0.8.10
^ permalink raw reply related
* [PATCH v5 net-next 3/7] net: mvneta: Use cacheable memory to store the rx buffer virtual address
From: Gregory CLEMENT @ 2016-11-30 21:42 UTC (permalink / raw)
To: David S. Miller, linux-kernel, netdev
Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
Yelena Krivosheev
In-Reply-To: <cover.0270f6d2413a709521fe2c8c17fbebea6f2e78d1.1480542157.git-series.gregory.clement@free-electrons.com>
Until now the virtual address of the received buffer were stored in the
cookie field of the rx descriptor. However, this field is 32-bits only
which prevents to use the driver on a 64-bits architecture.
With this patch the virtual address is stored in an array not shared with
the hardware (no more need to use the DMA API). Thanks to this, it is
possible to use cache contrary to the access of the rx descriptor member.
The change is done in the swbm path only because the hwbm uses the cookie
field, this also means that currently the hwbm is not usable in 64-bits.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
drivers/net/ethernet/marvell/mvneta.c | 34 +++++++++++++++++++---------
1 file changed, 24 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index f5319c50f8d9..92b9af14c352 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -561,6 +561,9 @@ struct mvneta_rx_queue {
u32 pkts_coal;
u32 time_coal;
+ /* Virtual address of the RX buffer */
+ void **buf_virt_addr;
+
/* Virtual address of the RX DMA descriptors array */
struct mvneta_rx_desc *descs;
@@ -1573,10 +1576,14 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
/* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
- u32 phys_addr, u32 cookie)
+ u32 phys_addr, void *virt_addr,
+ struct mvneta_rx_queue *rxq)
{
- rx_desc->buf_cookie = cookie;
+ int i;
+
rx_desc->buf_phys_addr = phys_addr;
+ i = rx_desc - rxq->descs;
+ rxq->buf_virt_addr[i] = virt_addr;
}
/* Decrement sent descriptors counter */
@@ -1781,7 +1788,8 @@ EXPORT_SYMBOL_GPL(mvneta_frag_free);
/* Refill processing for SW buffer management */
static int mvneta_rx_refill(struct mvneta_port *pp,
- struct mvneta_rx_desc *rx_desc)
+ struct mvneta_rx_desc *rx_desc,
+ struct mvneta_rx_queue *rxq)
{
dma_addr_t phys_addr;
@@ -1799,7 +1807,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
return -ENOMEM;
}
- mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
+ mvneta_rx_desc_fill(rx_desc, phys_addr, data, rxq);
return 0;
}
@@ -1861,7 +1869,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
for (i = 0; i < rxq->size; i++) {
struct mvneta_rx_desc *rx_desc = rxq->descs + i;
- void *data = (void *)rx_desc->buf_cookie;
+ void *data = rxq->buf_virt_addr[i];
dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
@@ -1894,12 +1902,13 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
unsigned char *data;
dma_addr_t phys_addr;
u32 rx_status, frag_size;
- int rx_bytes, err;
+ int rx_bytes, err, index;
rx_done++;
rx_status = rx_desc->status;
rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
- data = (unsigned char *)rx_desc->buf_cookie;
+ index = rx_desc - rxq->descs;
+ data = rxq->buf_virt_addr[index];
phys_addr = rx_desc->buf_phys_addr;
if (!mvneta_rxq_desc_is_first_last(rx_status) ||
@@ -1938,7 +1947,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
}
/* Refill processing */
- err = mvneta_rx_refill(pp, rx_desc);
+ err = mvneta_rx_refill(pp, rx_desc, rxq);
if (err) {
netdev_err(dev, "Linux processing - Can't refill\n");
rxq->missed++;
@@ -2020,7 +2029,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
rx_done++;
rx_status = rx_desc->status;
rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
- data = (unsigned char *)rx_desc->buf_cookie;
+ data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
phys_addr = rx_desc->buf_phys_addr;
pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2716,7 +2725,7 @@ static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
for (i = 0; i < num; i++) {
memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
- if (mvneta_rx_refill(pp, rxq->descs + i) != 0) {
+ if (mvneta_rx_refill(pp, rxq->descs + i, rxq) != 0) {
netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs filled\n",
__func__, rxq->id, i, num);
break;
@@ -3865,6 +3874,11 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
rxq->size = pp->rx_ring_size;
rxq->pkts_coal = MVNETA_RX_COAL_PKTS;
rxq->time_coal = MVNETA_RX_COAL_USEC;
+ rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
+ rxq->size * sizeof(void *),
+ GFP_KERNEL);
+ if (!rxq->buf_virt_addr)
+ return -ENOMEM;
}
return 0;
--
git-series 0.8.10
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox