* Re: [v2 Patch 2/2] mlx4: add dynamic LRO disable support
From: Stanislaw Gruszka @ 2010-06-15 10:14 UTC (permalink / raw)
To: Cong Wang
Cc: David Miller, netdev, nhorman, herbert.xu, bhutchings,
Ramkrishna.Vepa
In-Reply-To: <4C173B57.70601@redhat.com>
On Tue, 15 Jun 2010 16:35:35 +0800
Cong Wang <amwang@redhat.com> wrote:
> > BTW: seems default ethtool_op_set_flags introduce a bug on many
> > devices regarding ETH_FLAG_RXHASH. I think default should
> > be EOPNOTSUPP, and these few devices that actually support RXHASH
> > should have custom ethtool_ops->set_flags
>
> Hmm, you mean this?
>
> if (data & ETH_FLAG_RXHASH)
> + if (!ops->set_flags)
> + return -EOPNOTSUPP;
> ....
Not really, but I do not have good idea how patch with fix should
looks.
I dislike fact that we setup ->feature that are not in real supported by
particular device instead of returning EOPNOTSUPP. This actually include
both flags NETIF_F_LRO and NETIF_F_RXHASH.
Perhaps ethtool_op_set_flags should be removed and drivers should use
only custom version. In particular seems e1000e and sfc use this
function improperly and should have NULL as .set_flags.
I will think more about that and maybe cook some patches.
Stanislaw
^ permalink raw reply
* [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
From: Eric Dumazet @ 2010-06-15 10:14 UTC (permalink / raw)
To: David Miller; +Cc: netdev, bhutchings, Nick Piggin
In-Reply-To: <20100614.231412.39191304.davem@davemloft.net>
Le lundi 14 juin 2010 à 23:14 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 14 Jun 2010 17:59:22 +0200
>
> > Uses a seqcount_t to synchronize stat producer and consumer, for packets
> > and bytes counter, now u64 types.
> >
> > (dropped counter being rarely used, stay a native "unsigned long" type)
> >
> > No noticeable performance impact on x86, as it only adds two increments
> > per frame. It might be more expensive on arches where smp_wmb() is not
> > free.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Applied, but I suspect we might end up eventually needing to
> abstract this kind of technique in a common place so other
> spots can use it.
Here is the followup patch to abstract things a bit, before upcoming
conversions.
Thanks !
[PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
To properly implement 64bits network statistics on 32bit or 64bit hosts,
we provide one new type and four methods, to ease conversions.
Stats producer should use following template granted it already got an
exclusive access to counters (a previous lock is taken, or per cpu
data [used in a non preemptable context])
Let me repeat : stats producers must be serialized by other means before
using this template. Preemption must be disabled too.
u64_stats_update_begin(&stats->syncp);
stats->bytes += len;
stats->packets++;
u64_stats_update_end(&stats->syncp);
While a consumer should use following template to get consistent
snapshot :
u64 tbytes, tpackets;
unsigned int start;
do {
start = u64_stats_fetch_begin(&stats->syncp);
tbytes = stats->bytes;
tpackets = stats->packets;
} while (u64_stats_fetch_retry(&stats->lock, syncp));
This patch uses this infrastructure in net loopback driver, instead of
specific one added in commit 6b10de38f0ef (loopback: Implement 64bit
stats on 32bit arches)
Suggested by David Miller
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Nick Piggin <npiggin@suse.de>
---
drivers/net/loopback.c | 61 ++++++++----------------------------
include/linux/netdevice.h | 50 +++++++++++++++++++++++++++++
2 files changed, 65 insertions(+), 46 deletions(-)
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 09334f8..f20b156 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -60,51 +60,12 @@
#include <net/net_namespace.h>
struct pcpu_lstats {
- u64 packets;
- u64 bytes;
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
- seqcount_t seq;
-#endif
- unsigned long drops;
+ u64 packets;
+ u64 bytes;
+ struct u64_stats_sync syncp;
+ unsigned long drops;
};
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
-static void inline lstats_update_begin(struct pcpu_lstats *lstats)
-{
- write_seqcount_begin(&lstats->seq);
-}
-static void inline lstats_update_end(struct pcpu_lstats *lstats)
-{
- write_seqcount_end(&lstats->seq);
-}
-static void inline lstats_fetch_and_add(u64 *packets, u64 *bytes, const struct pcpu_lstats *lstats)
-{
- u64 tpackets, tbytes;
- unsigned int seq;
-
- do {
- seq = read_seqcount_begin(&lstats->seq);
- tpackets = lstats->packets;
- tbytes = lstats->bytes;
- } while (read_seqcount_retry(&lstats->seq, seq));
-
- *packets += tpackets;
- *bytes += tbytes;
-}
-#else
-static void inline lstats_update_begin(struct pcpu_lstats *lstats)
-{
-}
-static void inline lstats_update_end(struct pcpu_lstats *lstats)
-{
-}
-static void inline lstats_fetch_and_add(u64 *packets, u64 *bytes, const struct pcpu_lstats *lstats)
-{
- *packets += lstats->packets;
- *bytes += lstats->bytes;
-}
-#endif
-
/*
* The higher levels take care of making this non-reentrant (it's
* called with bh's disabled).
@@ -126,10 +87,10 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
len = skb->len;
if (likely(netif_rx(skb) == NET_RX_SUCCESS)) {
- lstats_update_begin(lb_stats);
+ u64_stats_update_begin(&lb_stats->syncp);
lb_stats->bytes += len;
lb_stats->packets++;
- lstats_update_end(lb_stats);
+ u64_stats_update_end(&lb_stats->syncp);
} else
lb_stats->drops++;
@@ -148,10 +109,18 @@ static struct rtnl_link_stats64 *loopback_get_stats64(struct net_device *dev)
pcpu_lstats = (void __percpu __force *)dev->ml_priv;
for_each_possible_cpu(i) {
const struct pcpu_lstats *lb_stats;
+ u64 tbytes, tpackets;
+ unsigned int start;
lb_stats = per_cpu_ptr(pcpu_lstats, i);
- lstats_fetch_and_add(&packets, &bytes, lb_stats);
+ do {
+ start = u64_stats_fetch_begin(&lb_stats->syncp);
+ tbytes = lb_stats->bytes;
+ tpackets = lb_stats->packets;
+ } while (u64_stats_fetch_retry(&lb_stats->syncp, start));
drops += lb_stats->drops;
+ bytes += tbytes;
+ packets += tpackets;
}
stats->rx_packets = packets;
stats->tx_packets = packets;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4fbccc5..dd1d93d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -174,6 +174,56 @@ static inline bool dev_xmit_complete(int rc)
#define NET_DEVICE_STATS_DEFINE(name) unsigned long pad_ ## name, name
#endif
+#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+struct u64_stats_sync {
+ seqcount_t seq;
+};
+
+static void inline u64_stats_update_begin(struct u64_stats_sync *syncp)
+{
+ write_seqcount_begin(&syncp->seq);
+}
+
+static void inline u64_stats_update_end(struct u64_stats_sync *syncp)
+{
+ write_seqcount_end(&syncp->seq);
+}
+
+static unsigned int inline u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
+{
+ return read_seqcount_begin(&syncp->seq);
+}
+
+static bool inline u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
+ unsigned int start)
+{
+ return read_seqcount_retry(&syncp->seq, start);
+}
+
+#else
+struct u64_stats_sync {
+};
+
+static void inline u64_stats_update_begin(struct u64_stats_sync *syncp)
+{
+}
+
+static void inline u64_stats_update_end(struct u64_stats_sync *syncp)
+{
+}
+
+static unsigned int inline u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
+{
+ return 0;
+}
+
+static bool inline u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
+ unsigned int start)
+{
+ return false;
+}
+#endif
+
struct net_device_stats {
NET_DEVICE_STATS_DEFINE(rx_packets);
NET_DEVICE_STATS_DEFINE(tx_packets);
^ permalink raw reply related
* Re: [PATCH 8/8] bridge: Fix netpoll support
From: Cong Wang @ 2010-06-15 10:28 UTC (permalink / raw)
To: Herbert Xu
Cc: Michael S. Tsirkin, Qianfeng Zhang, David S. Miller, netdev,
Stephen Hemminger, Matt Mackall, Paul E. McKenney
In-Reply-To: <E1OMtjm-0006PK-D6@gondolin.me.apana.org.au>
On 06/11/10 10:12, Herbert Xu wrote:
> bridge: Fix netpoll support
>
> There are multiple problems with the newly added netpoll support:
>
> 1) Use-after-free on each netpoll packet.
> 2) Invoking unsafe code on netpoll/IRQ path.
> 3) Breaks when netpoll is enabled on the underlying device.
>
> This patch fixes all of these problems. In particular, we now
> allocate proper netpoll structures for each underlying device.
>
> We only allow netpoll to be enabled on the bridge when all the
> devices underneath it support netpoll. Once it is enabled, we
> do not allow non-netpoll devices to join the bridge (until netpoll
> is disabled again).
>
This is a good idea!
> This allows us to do away with the npinfo juggling that caused
> problem number 1.
>
> Incidentally this patch fixes number 2 by bypassing unsafe code
> such as multicast snooping and netfilter.
Not sure if I understand problem 2) and 3), this patch is not easy
to review. So, what's the point of adding ->np to struct net_bridge_port?
since we already have p->dev->npinfo->netpoll?
Thanks!
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
From: Nick Piggin @ 2010-06-15 10:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, bhutchings
In-Reply-To: <1276596856.2541.84.camel@edumazet-laptop>
On Tue, Jun 15, 2010 at 12:14:16PM +0200, Eric Dumazet wrote:
> Here is the followup patch to abstract things a bit, before upcoming
> conversions.
>
> Thanks !
>
> [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
>
> To properly implement 64bits network statistics on 32bit or 64bit hosts,
> we provide one new type and four methods, to ease conversions.
>
> Stats producer should use following template granted it already got an
> exclusive access to counters (a previous lock is taken, or per cpu
> data [used in a non preemptable context])
>
> Let me repeat : stats producers must be serialized by other means before
> using this template. Preemption must be disabled too.
>
> u64_stats_update_begin(&stats->syncp);
> stats->bytes += len;
> stats->packets++;
> u64_stats_update_end(&stats->syncp);
>
> While a consumer should use following template to get consistent
> snapshot :
>
> u64 tbytes, tpackets;
> unsigned int start;
>
> do {
> start = u64_stats_fetch_begin(&stats->syncp);
> tbytes = stats->bytes;
> tpackets = stats->packets;
> } while (u64_stats_fetch_retry(&stats->lock, syncp));
>
> This patch uses this infrastructure in net loopback driver, instead of
> specific one added in commit 6b10de38f0ef (loopback: Implement 64bit
> stats on 32bit arches)
>
> Suggested by David Miller
Cool, I don't mind this, but perhaps could you add some comments
because it _will_ either be misused or copied and misused elsewhere :)
Callers must:
- write side must ensure mutual exclusion (even if it was previously OK
to have lost updates on the writer side, the seqlock will explodde if
it is taken concurrently for write)
- write side must not sleep
- readside and writeside must have local-CPU exclusion from one another;
preempt, irq, bh as appropriate
- will only protect 64-bit sizes from tearing -- eg updating 2 different
stats under the same write side will not ensure they are both seen in
the same read side
But I do like the minimal design.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Nick Piggin <npiggin@suse.de>
> ---
> drivers/net/loopback.c | 61 ++++++++----------------------------
> include/linux/netdevice.h | 50 +++++++++++++++++++++++++++++
> 2 files changed, 65 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index 09334f8..f20b156 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -60,51 +60,12 @@
> #include <net/net_namespace.h>
>
> struct pcpu_lstats {
> - u64 packets;
> - u64 bytes;
> -#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> - seqcount_t seq;
> -#endif
> - unsigned long drops;
> + u64 packets;
> + u64 bytes;
> + struct u64_stats_sync syncp;
> + unsigned long drops;
> };
>
> -#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> -static void inline lstats_update_begin(struct pcpu_lstats *lstats)
> -{
> - write_seqcount_begin(&lstats->seq);
> -}
> -static void inline lstats_update_end(struct pcpu_lstats *lstats)
> -{
> - write_seqcount_end(&lstats->seq);
> -}
> -static void inline lstats_fetch_and_add(u64 *packets, u64 *bytes, const struct pcpu_lstats *lstats)
> -{
> - u64 tpackets, tbytes;
> - unsigned int seq;
> -
> - do {
> - seq = read_seqcount_begin(&lstats->seq);
> - tpackets = lstats->packets;
> - tbytes = lstats->bytes;
> - } while (read_seqcount_retry(&lstats->seq, seq));
> -
> - *packets += tpackets;
> - *bytes += tbytes;
> -}
> -#else
> -static void inline lstats_update_begin(struct pcpu_lstats *lstats)
> -{
> -}
> -static void inline lstats_update_end(struct pcpu_lstats *lstats)
> -{
> -}
> -static void inline lstats_fetch_and_add(u64 *packets, u64 *bytes, const struct pcpu_lstats *lstats)
> -{
> - *packets += lstats->packets;
> - *bytes += lstats->bytes;
> -}
> -#endif
> -
> /*
> * The higher levels take care of making this non-reentrant (it's
> * called with bh's disabled).
> @@ -126,10 +87,10 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
>
> len = skb->len;
> if (likely(netif_rx(skb) == NET_RX_SUCCESS)) {
> - lstats_update_begin(lb_stats);
> + u64_stats_update_begin(&lb_stats->syncp);
> lb_stats->bytes += len;
> lb_stats->packets++;
> - lstats_update_end(lb_stats);
> + u64_stats_update_end(&lb_stats->syncp);
> } else
> lb_stats->drops++;
>
> @@ -148,10 +109,18 @@ static struct rtnl_link_stats64 *loopback_get_stats64(struct net_device *dev)
> pcpu_lstats = (void __percpu __force *)dev->ml_priv;
> for_each_possible_cpu(i) {
> const struct pcpu_lstats *lb_stats;
> + u64 tbytes, tpackets;
> + unsigned int start;
>
> lb_stats = per_cpu_ptr(pcpu_lstats, i);
> - lstats_fetch_and_add(&packets, &bytes, lb_stats);
> + do {
> + start = u64_stats_fetch_begin(&lb_stats->syncp);
> + tbytes = lb_stats->bytes;
> + tpackets = lb_stats->packets;
> + } while (u64_stats_fetch_retry(&lb_stats->syncp, start));
> drops += lb_stats->drops;
> + bytes += tbytes;
> + packets += tpackets;
> }
> stats->rx_packets = packets;
> stats->tx_packets = packets;
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4fbccc5..dd1d93d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -174,6 +174,56 @@ static inline bool dev_xmit_complete(int rc)
> #define NET_DEVICE_STATS_DEFINE(name) unsigned long pad_ ## name, name
> #endif
>
> +#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> +struct u64_stats_sync {
> + seqcount_t seq;
> +};
> +
> +static void inline u64_stats_update_begin(struct u64_stats_sync *syncp)
> +{
> + write_seqcount_begin(&syncp->seq);
> +}
> +
> +static void inline u64_stats_update_end(struct u64_stats_sync *syncp)
> +{
> + write_seqcount_end(&syncp->seq);
> +}
> +
> +static unsigned int inline u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
> +{
> + return read_seqcount_begin(&syncp->seq);
> +}
> +
> +static bool inline u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
> + unsigned int start)
> +{
> + return read_seqcount_retry(&syncp->seq, start);
> +}
> +
> +#else
> +struct u64_stats_sync {
> +};
> +
> +static void inline u64_stats_update_begin(struct u64_stats_sync *syncp)
> +{
> +}
> +
> +static void inline u64_stats_update_end(struct u64_stats_sync *syncp)
> +{
> +}
> +
> +static unsigned int inline u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
> +{
> + return 0;
> +}
> +
> +static bool inline u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
> + unsigned int start)
> +{
> + return false;
> +}
> +#endif
> +
> struct net_device_stats {
> NET_DEVICE_STATS_DEFINE(rx_packets);
> NET_DEVICE_STATS_DEFINE(tx_packets);
>
^ permalink raw reply
* [PATCH net-next-2.6] bridge: 64bit rx/tx counters
From: Eric Dumazet @ 2010-06-15 10:39 UTC (permalink / raw)
To: David Miller, Stephen Hemminger; +Cc: netdev, bhutchings, Nick Piggin
In-Reply-To: <1276596856.2541.84.camel@edumazet-laptop>
Note : should be applied after "net: Introduce
u64_stats_sync infrastructure", if accepted.
Thanks
[PATCH net-next-2.6] bridge: 64bit rx/tx counters
Use u64_stats_sync infrastructure to provide 64bit rx/tx
counters even on 32bit hosts.
It is safe to use a single u64_stats_sync for rx and tx,
because BH is disabled on both, and we use per_cpu data.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/bridge/br_device.c | 24 +++++++++++++++---------
net/bridge/br_input.c | 2 ++
net/bridge/br_private.h | 9 +++++----
3 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index b898364..d7bfe31 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -38,8 +38,10 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
}
#endif
+ u64_stats_update_begin(&brstats->syncp);
brstats->tx_packets++;
brstats->tx_bytes += skb->len;
+ u64_stats_update_end(&brstats->syncp);
BR_INPUT_SKB_CB(skb)->brdev = dev;
@@ -92,21 +94,25 @@ static int br_dev_stop(struct net_device *dev)
return 0;
}
-static struct net_device_stats *br_get_stats(struct net_device *dev)
+static struct rtnl_link_stats64 *br_get_stats64(struct net_device *dev)
{
struct net_bridge *br = netdev_priv(dev);
- struct net_device_stats *stats = &dev->stats;
- struct br_cpu_netstats sum = { 0 };
+ struct rtnl_link_stats64 *stats = &dev->stats64;
+ struct br_cpu_netstats tmp, sum = { 0 };
unsigned int cpu;
for_each_possible_cpu(cpu) {
+ unsigned int start;
const struct br_cpu_netstats *bstats
= per_cpu_ptr(br->stats, cpu);
-
- sum.tx_bytes += bstats->tx_bytes;
- sum.tx_packets += bstats->tx_packets;
- sum.rx_bytes += bstats->rx_bytes;
- sum.rx_packets += bstats->rx_packets;
+ do {
+ start = u64_stats_fetch_begin(&bstats->syncp);
+ memcpy(&tmp, bstats, sizeof(tmp));
+ } while (u64_stats_fetch_retry(&bstats->syncp, start));
+ sum.tx_bytes += tmp.tx_bytes;
+ sum.tx_packets += tmp.tx_packets;
+ sum.rx_bytes += tmp.rx_bytes;
+ sum.rx_packets += tmp.rx_packets;
}
stats->tx_bytes = sum.tx_bytes;
@@ -288,7 +294,7 @@ static const struct net_device_ops br_netdev_ops = {
.ndo_open = br_dev_open,
.ndo_stop = br_dev_stop,
.ndo_start_xmit = br_dev_xmit,
- .ndo_get_stats = br_get_stats,
+ .ndo_get_stats64 = br_get_stats64,
.ndo_set_mac_address = br_set_mac_address,
.ndo_set_multicast_list = br_dev_set_multicast_list,
.ndo_change_mtu = br_change_mtu,
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 99647d8..86d357b 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -27,8 +27,10 @@ static int br_pass_frame_up(struct sk_buff *skb)
struct net_bridge *br = netdev_priv(brdev);
struct br_cpu_netstats *brstats = this_cpu_ptr(br->stats);
+ u64_stats_update_begin(&brstats->syncp);
brstats->rx_packets++;
brstats->rx_bytes += skb->len;
+ u64_stats_update_end(&brstats->syncp);
indev = skb->dev;
skb->dev = brdev;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index c83519b..f078c54 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -146,10 +146,11 @@ struct net_bridge_port
};
struct br_cpu_netstats {
- unsigned long rx_packets;
- unsigned long rx_bytes;
- unsigned long tx_packets;
- unsigned long tx_bytes;
+ u64 rx_packets;
+ u64 rx_bytes;
+ u64 tx_packets;
+ u64 tx_bytes;
+ struct u64_stats_sync syncp;
};
struct net_bridge
^ permalink raw reply related
* Re: [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
From: Eric Dumazet @ 2010-06-15 10:43 UTC (permalink / raw)
To: Nick Piggin; +Cc: David Miller, netdev, bhutchings
In-Reply-To: <20100615102541.GH6138@laptop>
Le mardi 15 juin 2010 à 20:25 +1000, Nick Piggin a écrit :
> On Tue, Jun 15, 2010 at 12:14:16PM +0200, Eric Dumazet wrote:
> > Here is the followup patch to abstract things a bit, before upcoming
> > conversions.
> >
> > Thanks !
> >
> > [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
> >
> > To properly implement 64bits network statistics on 32bit or 64bit hosts,
> > we provide one new type and four methods, to ease conversions.
> >
> > Stats producer should use following template granted it already got an
> > exclusive access to counters (a previous lock is taken, or per cpu
> > data [used in a non preemptable context])
> >
> > Let me repeat : stats producers must be serialized by other means before
> > using this template. Preemption must be disabled too.
> >
> > u64_stats_update_begin(&stats->syncp);
> > stats->bytes += len;
> > stats->packets++;
> > u64_stats_update_end(&stats->syncp);
> >
> > While a consumer should use following template to get consistent
> > snapshot :
> >
> > u64 tbytes, tpackets;
> > unsigned int start;
> >
> > do {
> > start = u64_stats_fetch_begin(&stats->syncp);
> > tbytes = stats->bytes;
> > tpackets = stats->packets;
> > } while (u64_stats_fetch_retry(&stats->lock, syncp));
> >
> > This patch uses this infrastructure in net loopback driver, instead of
> > specific one added in commit 6b10de38f0ef (loopback: Implement 64bit
> > stats on 32bit arches)
> >
> > Suggested by David Miller
>
> Cool, I don't mind this, but perhaps could you add some comments
> because it _will_ either be misused or copied and misused elsewhere :)
>
> Callers must:
> - write side must ensure mutual exclusion (even if it was previously OK
> to have lost updates on the writer side, the seqlock will explodde if
> it is taken concurrently for write)
> - write side must not sleep
> - readside and writeside must have local-CPU exclusion from one another;
> preempt, irq, bh as appropriate
> - will only protect 64-bit sizes from tearing -- eg updating 2 different
> stats under the same write side will not ensure they are both seen in
> the same read side
Hmm, I am not sure I got this one, could you please give me a buggy
example ?
>
> But I do like the minimal design.
Thanks !
I'll submit a v2 patch after my lunch to add all your comments, because
all clarifications are indeed very very welcomed !
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
From: Nick Piggin @ 2010-06-15 11:04 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, bhutchings
In-Reply-To: <1276598605.2541.96.camel@edumazet-laptop>
On Tue, Jun 15, 2010 at 12:43:25PM +0200, Eric Dumazet wrote:
> Le mardi 15 juin 2010 à 20:25 +1000, Nick Piggin a écrit :
> > On Tue, Jun 15, 2010 at 12:14:16PM +0200, Eric Dumazet wrote:
> > > Here is the followup patch to abstract things a bit, before upcoming
> > > conversions.
> > >
> > > Thanks !
> > >
> > > [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
> > >
> > > To properly implement 64bits network statistics on 32bit or 64bit hosts,
> > > we provide one new type and four methods, to ease conversions.
> > >
> > > Stats producer should use following template granted it already got an
> > > exclusive access to counters (a previous lock is taken, or per cpu
> > > data [used in a non preemptable context])
> > >
> > > Let me repeat : stats producers must be serialized by other means before
> > > using this template. Preemption must be disabled too.
> > >
> > > u64_stats_update_begin(&stats->syncp);
> > > stats->bytes += len;
> > > stats->packets++;
> > > u64_stats_update_end(&stats->syncp);
> > >
> > > While a consumer should use following template to get consistent
> > > snapshot :
> > >
> > > u64 tbytes, tpackets;
> > > unsigned int start;
> > >
> > > do {
> > > start = u64_stats_fetch_begin(&stats->syncp);
> > > tbytes = stats->bytes;
> > > tpackets = stats->packets;
> > > } while (u64_stats_fetch_retry(&stats->lock, syncp));
> > >
> > > This patch uses this infrastructure in net loopback driver, instead of
> > > specific one added in commit 6b10de38f0ef (loopback: Implement 64bit
> > > stats on 32bit arches)
> > >
> > > Suggested by David Miller
> >
> > Cool, I don't mind this, but perhaps could you add some comments
> > because it _will_ either be misused or copied and misused elsewhere :)
> >
> > Callers must:
> > - write side must ensure mutual exclusion (even if it was previously OK
> > to have lost updates on the writer side, the seqlock will explodde if
> > it is taken concurrently for write)
> > - write side must not sleep
> > - readside and writeside must have local-CPU exclusion from one another;
> > preempt, irq, bh as appropriate
> > - will only protect 64-bit sizes from tearing -- eg updating 2 different
> > stats under the same write side will not ensure they are both seen in
> > the same read side
>
> Hmm, I am not sure I got this one, could you please give me a buggy
> example ?
So if you have a regular seqlock, the sequence:
write_seqcount_begin
stat1++
stat2--
write_seqcount_end
do read_seqcount_begin
sum = stat1+stat2;
while (read_seqcount_retry)
BUG_ON(sum != 0);
This code is OK. But if it is using your stat sync, then it is buggy.
This is obvious to you of course, but someone who doesn't consider
the implementation might get caught out.
I guess all my other points are properties of seqcount code itself,
but they are not documented really well with the seqlock API
unfortunately.
> > But I do like the minimal design.
>
> Thanks !
>
> I'll submit a v2 patch after my lunch to add all your comments, because
> all clarifications are indeed very very welcomed !
Thanks!
^ permalink raw reply
* [PATCH 0/2] net,man - IP_NODEFRAG option for IPv4 socket
From: Jiri Olsa @ 2010-06-15 11:07 UTC (permalink / raw)
To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
jengelh-nopoi9nDyk+ELgA04lAiVw, kaber-dcUjhNyLwpNeoWH0uzbU5w
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
linux-man-u79uwXL29TY76Z2rM5mHXA
hi,
this series contains patch for IP_NODEFRAG option for RAW sockets,
and changes for man pages.
I'm sending this together, hope it's ok.
1/2 - net - IP_NODEFRAG option for IPv4 socket
2/2 - man - IP_NODEFRAG option for IPv4 socket
wbr,
jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 1/2] net - IP_NODEFRAG option for IPv4 socket
From: Jiri Olsa @ 2010-06-15 11:07 UTC (permalink / raw)
To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
jengelh-nopoi9nDyk+ELgA04lAiVw, kaber-dcUjhNyLwpNeoWH0uzbU5w
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
linux-man-u79uwXL29TY76Z2rM5mHXA, Jiri Olsa
In-Reply-To: <1276600052-16499-1-git-send-email-jolsa-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
hi,
this patch is implementing IP_NODEFRAG option for IPv4 socket.
The reason is, there's no other way to send out the packet with user
customized header of the reassembly part.
wbr,
jirka
Signed-off-by: Jiri Olsa <jolsa-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
include/linux/in.h | 1 +
include/net/inet_sock.h | 3 ++-
net/ipv4/af_inet.c | 2 ++
net/ipv4/ip_sockglue.c | 9 ++++++++-
net/ipv4/netfilter/nf_defrag_ipv4.c | 5 +++++
5 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/in.h b/include/linux/in.h
index 583c76f..41d88a4 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -85,6 +85,7 @@ struct in_addr {
#define IP_RECVORIGDSTADDR IP_ORIGDSTADDR
#define IP_MINTTL 21
+#define IP_NODEFRAG 22
/* IP_MTU_DISCOVER values */
#define IP_PMTUDISC_DONT 0 /* Never send DF frames */
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1653de5..1989cfd 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -137,7 +137,8 @@ struct inet_sock {
hdrincl:1,
mc_loop:1,
transparent:1,
- mc_all:1;
+ mc_all:1,
+ nodefrag:1;
int mc_index;
__be32 mc_addr;
struct ip_mc_socklist *mc_list;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 551ce56..84d2c8e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -355,6 +355,8 @@ lookup_protocol:
inet = inet_sk(sk);
inet->is_icsk = (INET_PROTOSW_ICSK & answer_flags) != 0;
+ inet->nodefrag = 0;
+
if (SOCK_RAW == sock->type) {
inet->inet_num = protocol;
if (IPPROTO_RAW == protocol)
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index ce23178..d8196e1 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -449,7 +449,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
(1<<IP_MTU_DISCOVER) | (1<<IP_RECVERR) |
(1<<IP_ROUTER_ALERT) | (1<<IP_FREEBIND) |
(1<<IP_PASSSEC) | (1<<IP_TRANSPARENT) |
- (1<<IP_MINTTL))) ||
+ (1<<IP_MINTTL) | (1<<IP_NODEFRAG))) ||
optname == IP_MULTICAST_TTL ||
optname == IP_MULTICAST_ALL ||
optname == IP_MULTICAST_LOOP ||
@@ -572,6 +572,13 @@ static int do_ip_setsockopt(struct sock *sk, int level,
}
inet->hdrincl = val ? 1 : 0;
break;
+ case IP_NODEFRAG:
+ if (sk->sk_type != SOCK_RAW) {
+ err = -ENOPROTOOPT;
+ break;
+ }
+ inet->nodefrag = val ? 1 : 0;
+ break;
case IP_MTU_DISCOVER:
if (val < IP_PMTUDISC_DONT || val > IP_PMTUDISC_PROBE)
goto e_inval;
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index cb763ae..eab8de3 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -66,6 +66,11 @@ static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
const struct net_device *out,
int (*okfn)(struct sk_buff *))
{
+ struct inet_sock *inet = inet_sk(skb->sk);
+
+ if (inet && inet->nodefrag)
+ return NF_ACCEPT;
+
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
#if !defined(CONFIG_NF_NAT) && !defined(CONFIG_NF_NAT_MODULE)
/* Previously seen (loopback)? Ignore. Do this before
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 2/2] man - IP_NODEFRAG option for IPv4 socket
From: Jiri Olsa @ 2010-06-15 11:07 UTC (permalink / raw)
To: eric.dumazet, jengelh, kaber
Cc: netdev, netfilter-devel, linux-man, Jiri Olsa
In-Reply-To: <1276600052-16499-1-git-send-email-jolsa@redhat.com>
hi,
this patch adds description for IP_NODEFRAG option for IPv4 socket.
wbr,
jirka
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
---
man7/ip.7 | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/man7/ip.7 b/man7/ip.7
index 34f7e80..12c0d5b 100644
--- a/man7/ip.7
+++ b/man7/ip.7
@@ -705,6 +705,12 @@ socket option (see
.BR IP_TTL " (since Linux 1.0)"
Set or retrieve the current time-to-live field that is used in every packet
sent from this socket.
+.TP
+.BR IP_NODEFRAG " (since Linux 2.6)"
+If enabled, the reassembly of outgoing packets is disabled in the netfilter layer.
+Only valid for
+.B SOCK_RAW
+sockets.
.\" FIXME Document IP_XFRM_POLICY
.\" Since Linux 2.5.48
.\" Needs CAP_NET_ADMIN
^ permalink raw reply related
* Re: ftp on udp and tcp .
From: Ben Hutchings @ 2010-06-15 11:50 UTC (permalink / raw)
To: ratheesh k; +Cc: netdev, linux-net
In-Reply-To: <AANLkTimz7A03_UwOcjgrixqGH9MW8i1dAH6ULfR-sS5z@mail.gmail.com>
On Tue, 2010-06-15 at 12:50 +0530, ratheesh k wrote:
> Hi ,
>
> what is real advantage of UDP over TCP / TCP over UDP . If i
> put the question in a different way - if run ftp application on UDP
> and TCP , what is the real difference / advantage ??
TCP handles segmentation and bandwidth/congestion management. It
recovers from packet loss and reordering. Recent implementations are
also quite resilient to address spoofing. The downsides are higher
latency (potentially very high when recovering from packet loss), higher
setup cost (3-way handshake), and lack of multicast support.
UDP does none of that. It gives you flexibility and control, but the
downside is you have to handle all of those problems yourself.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* udp: Fix bogus UFO packet generation
From: Herbert Xu @ 2010-06-15 11:52 UTC (permalink / raw)
To: David S. Miller, netdev
Hi:
udp: Fix bogus UFO packet generation
It has been reported that the new UFO software fallback path
fails under certain conditions with NFS. I tracked the problem
down to the generation of UFO packets that are smaller than the
MTU. The software fallback path simply discards these packets.
This patch fixes the problem by not generating such packets on
the UFO path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9a4a6c9..041d41d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -873,8 +873,10 @@ int ip_append_data(struct sock *sk,
!exthdrlen)
csummode = CHECKSUM_PARTIAL;
+ skb = skb_peek_tail(&sk->sk_write_queue);
+
inet->cork.length += length;
- if (((length> mtu) || !skb_queue_empty(&sk->sk_write_queue)) &&
+ if (((length > mtu) || (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
(rt->u.dst.dev->features & NETIF_F_UFO)) {
err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
@@ -892,7 +894,7 @@ int ip_append_data(struct sock *sk,
* adding appropriate IP header.
*/
- if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL)
+ if (!skb)
goto alloc_new_skb;
while (length > 0) {
@@ -1121,7 +1123,8 @@ ssize_t ip_append_page(struct sock *sk, struct page *page,
return -EINVAL;
inet->cork.length += size;
- if ((sk->sk_protocol == IPPROTO_UDP) &&
+ if ((size + skb->len > mtu) &&
+ (sk->sk_protocol == IPPROTO_UDP) &&
(rt->u.dst.dev->features & NETIF_F_UFO)) {
skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related
* Re: [PATCH v5] netfilter: Xtables: idletimer target implementation
From: Patrick McHardy @ 2010-06-15 11:54 UTC (permalink / raw)
To: Luciano Coelho; +Cc: netfilter-devel, netdev, jengelh, Timo Teras
In-Reply-To: <1276593679-9576-1-git-send-email-luciano.coelho@nokia.com>
Two final comments:
Luciano Coelho wrote:
> diff --git a/include/linux/netfilter/xt_IDLETIMER.h b/include/linux/netfilter/xt_IDLETIMER.h
> new file mode 100644
> index 0000000..3e1aa1b
> --- /dev/null
> +++ b/include/linux/netfilter/xt_IDLETIMER.h
This file needs to be added to Kbuild to make sure it gets installed
during "make headers_install".
> +static void idletimer_tg_destroy(const struct xt_tgdtor_param *par)
> +{
> + const struct idletimer_tg_info *info = par->targinfo;
> +
> + pr_debug("destroy targinfo %s\n", info->label);
> +
> + mutex_lock(&list_mutex);
> + if (!info->timer) {
>
This doesn't seem to be possible.
> + mutex_unlock(&list_mutex);
> + return;
> + }
> +
> + if (--info->timer->refcnt == 0) {
> + pr_debug("deleting timer %s\n", info->label);
> +
> + list_del(&info->timer->entry);
> + del_timer_sync(&info->timer->timer);
> + sysfs_remove_file(idletimer_tg_kobj, &info->timer->attr.attr);
> + kfree(info->timer->attr.attr.name);
> + kfree(info->timer);
> + } else {
> + pr_debug("decreased refcnt of timer %s to %u\n",
> + info->label, info->timer->refcnt);
> + }
> +
> + mutex_unlock(&list_mutex);
> +}
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: Introduce u64_stats_sync infrastructure
From: Eric Dumazet @ 2010-06-15 12:12 UTC (permalink / raw)
To: Nick Piggin; +Cc: David Miller, netdev, bhutchings
In-Reply-To: <20100615110413.GJ6138@laptop>
Le mardi 15 juin 2010 à 21:04 +1000, Nick Piggin a écrit :
> So if you have a regular seqlock, the sequence:
>
> write_seqcount_begin
> stat1++
> stat2--
> write_seqcount_end
>
> do read_seqcount_begin
> sum = stat1+stat2;
> while (read_seqcount_retry)
> BUG_ON(sum != 0);
>
> This code is OK. But if it is using your stat sync, then it is buggy.
> This is obvious to you of course, but someone who doesn't consider
> the implementation might get caught out.
>
> I guess all my other points are properties of seqcount code itself,
> but they are not documented really well with the seqlock API
> unfortunately.
>
Yep, not counting fact that stat_sync is a noop on 64 bits anyway, so
external synchronization is pretty much needed (and probably already
present on all path were I need this stuff. If not, it must be fixed)
^ permalink raw reply
* Re: udp: Fix bogus UFO packet generation
From: Michael S. Tsirkin @ 2010-06-15 12:15 UTC (permalink / raw)
To: Herbert Xu; +Cc: David S. Miller, netdev
In-Reply-To: <20100615115225.GA7239@gondor.apana.org.au>
On Tue, Jun 15, 2010 at 09:52:25PM +1000, Herbert Xu wrote:
> Hi:
>
> udp: Fix bogus UFO packet generation
>
> It has been reported that the new UFO software fallback path
> fails under certain conditions with NFS. I tracked the problem
> down to the generation of UFO packets that are smaller than the
> MTU. The software fallback path simply discards these packets.
>
> This patch fixes the problem by not generating such packets on
> the UFO path.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
FWIW
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 9a4a6c9..041d41d 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -873,8 +873,10 @@ int ip_append_data(struct sock *sk,
> !exthdrlen)
> csummode = CHECKSUM_PARTIAL;
>
> + skb = skb_peek_tail(&sk->sk_write_queue);
> +
> inet->cork.length += length;
> - if (((length> mtu) || !skb_queue_empty(&sk->sk_write_queue)) &&
> + if (((length > mtu) || (skb && skb_is_gso(skb))) &&
> (sk->sk_protocol == IPPROTO_UDP) &&
> (rt->u.dst.dev->features & NETIF_F_UFO)) {
> err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
> @@ -892,7 +894,7 @@ int ip_append_data(struct sock *sk,
> * adding appropriate IP header.
> */
>
> - if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL)
> + if (!skb)
> goto alloc_new_skb;
>
> while (length > 0) {
> @@ -1121,7 +1123,8 @@ ssize_t ip_append_page(struct sock *sk, struct page *page,
> return -EINVAL;
>
> inet->cork.length += size;
> - if ((sk->sk_protocol == IPPROTO_UDP) &&
> + if ((size + skb->len > mtu) &&
> + (sk->sk_protocol == IPPROTO_UDP) &&
> (rt->u.dst.dev->features & NETIF_F_UFO)) {
> skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
> skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
>
> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v5] netfilter: Xtables: idletimer target implementation
From: Luciano Coelho @ 2010-06-15 12:34 UTC (permalink / raw)
To: ext Patrick McHardy
Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
jengelh@medozas.de, Timo Teras
In-Reply-To: <4C1769F1.40506@trash.net>
On Tue, 2010-06-15 at 13:54 +0200, ext Patrick McHardy wrote:
> Two final comments:
>
> Luciano Coelho wrote:
> > diff --git a/include/linux/netfilter/xt_IDLETIMER.h b/include/linux/netfilter/xt_IDLETIMER.h
> > new file mode 100644
> > index 0000000..3e1aa1b
> > --- /dev/null
> > +++ b/include/linux/netfilter/xt_IDLETIMER.h
>
> This file needs to be added to Kbuild to make sure it gets installed
> during "make headers_install".
Yes, thanks for pointing out. I'll add it.
> > +static void idletimer_tg_destroy(const struct xt_tgdtor_param *par)
> > +{
> > + const struct idletimer_tg_info *info = par->targinfo;
> > +
> > + pr_debug("destroy targinfo %s\n", info->label);
> > +
> > + mutex_lock(&list_mutex);
> > + if (!info->timer) {
> >
>
> This doesn't seem to be possible.
As usual, you're right. I'll remove this useless check.
v6 coming soon!
--
Cheers,
Luca.
^ permalink raw reply
* [PATCH v6] netfilter: Xtables: idletimer target implementation
From: Luciano Coelho @ 2010-06-15 12:45 UTC (permalink / raw)
To: netfilter-devel; +Cc: netdev, kaber, jengelh, Timo Teras
This patch implements an idletimer Xtables target that can be used to
identify when interfaces have been idle for a certain period of time.
Timers are identified by labels and are created when a rule is set with a new
label. The rules also take a timeout value (in seconds) as an option. If
more than one rule uses the same timer label, the timer will be restarted
whenever any of the rules get a hit.
One entry for each timer is created in sysfs. This attribute contains the
timer remaining for the timer to expire. The attributes are located under
the xt_idletimer class:
/sys/class/xt_idletimer/timers/<label>
When the timer expires, the target module sends a sysfs notification to the
userspace, which can then decide what to do (eg. disconnect to save power).
Cc: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
---
v2: Fixed according to Jan's comments
v3: Changed to a device class in the virtual bus in sysfs
Removed unnecessary attribute group
Fixed missing deallocation in some error cases
v4: Fixed according to Jan's and Patrick's comments to v3
Changed to mutex locking instead of spin locks
Save the timer in the target info struct to avoid extra reads
Other small clean-ups
v5: Changed max label length to 28 to avoid waste in 64-bit archs
Changed unnecessary atomic memory allocation
Removed mutex locks from the target function
v6: Added xt_IDLETIMER.h to Kbuild
Removed one useless check in idletimer_tg_destroy()
include/linux/netfilter/Kbuild | 1 +
include/linux/netfilter/xt_IDLETIMER.h | 45 +++++
net/netfilter/Kconfig | 12 ++
net/netfilter/Makefile | 1 +
net/netfilter/xt_IDLETIMER.c | 314 ++++++++++++++++++++++++++++++++
5 files changed, 373 insertions(+), 0 deletions(-)
create mode 100644 include/linux/netfilter/xt_IDLETIMER.h
create mode 100644 net/netfilter/xt_IDLETIMER.c
diff --git a/include/linux/netfilter/Kbuild b/include/linux/netfilter/Kbuild
index 48767cd..bb103f4 100644
--- a/include/linux/netfilter/Kbuild
+++ b/include/linux/netfilter/Kbuild
@@ -8,6 +8,7 @@ header-y += xt_CONNMARK.h
header-y += xt_CONNSECMARK.h
header-y += xt_CT.h
header-y += xt_DSCP.h
+header-y += xt_IDLETIMER.h
header-y += xt_LED.h
header-y += xt_MARK.h
header-y += xt_NFLOG.h
diff --git a/include/linux/netfilter/xt_IDLETIMER.h b/include/linux/netfilter/xt_IDLETIMER.h
new file mode 100644
index 0000000..3e1aa1b
--- /dev/null
+++ b/include/linux/netfilter/xt_IDLETIMER.h
@@ -0,0 +1,45 @@
+/*
+ * linux/include/linux/netfilter/xt_IDLETIMER.h
+ *
+ * Header file for Xtables timer target module.
+ *
+ * Copyright (C) 2004, 2010 Nokia Corporation
+ * Written by Timo Teras <ext-timo.teras@nokia.com>
+ *
+ * Converted to x_tables and forward-ported to 2.6.34
+ * by Luciano Coelho <luciano.coelho@nokia.com>
+ *
+ * Contact: Luciano Coelho <luciano.coelho@nokia.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
+ * 02110-1301 USA
+ */
+
+#ifndef _XT_IDLETIMER_H
+#define _XT_IDLETIMER_H
+
+#include <linux/types.h>
+
+#define MAX_IDLETIMER_LABEL_SIZE 28
+
+struct idletimer_tg_info {
+ __u32 timeout;
+
+ char label[MAX_IDLETIMER_LABEL_SIZE];
+
+ /* for kernel module internal use only */
+ struct idletimer_tg *timer __attribute((aligned(8)));
+};
+
+#endif
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 8593a77..413ed24 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -424,6 +424,18 @@ config NETFILTER_XT_TARGET_HL
since you can easily create immortal packets that loop
forever on the network.
+config NETFILTER_XT_TARGET_IDLETIMER
+ tristate "IDLETIMER target support"
+ depends on NETFILTER_ADVANCED
+ help
+
+ This option adds the `IDLETIMER' target. Each matching packet
+ resets the timer associated with label specified when the rule is
+ added. When the timer expires, it triggers a sysfs notification.
+ The remaining time for expiration can be read via sysfs.
+
+ To compile it as a module, choose M here. If unsure, say N.
+
config NETFILTER_XT_TARGET_LED
tristate '"LED" target support'
depends on LEDS_CLASS && LEDS_TRIGGERS
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 14e3a8f..e28420a 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_TCPMSS) += xt_TCPMSS.o
obj-$(CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP) += xt_TCPOPTSTRIP.o
obj-$(CONFIG_NETFILTER_XT_TARGET_TEE) += xt_TEE.o
obj-$(CONFIG_NETFILTER_XT_TARGET_TRACE) += xt_TRACE.o
+obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
# matches
obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c
new file mode 100644
index 0000000..fd9e134
--- /dev/null
+++ b/net/netfilter/xt_IDLETIMER.c
@@ -0,0 +1,314 @@
+/*
+ * linux/net/netfilter/xt_IDLETIMER.c
+ *
+ * Netfilter module to trigger a timer when packet matches.
+ * After timer expires a kevent will be sent.
+ *
+ * Copyright (C) 2004, 2010 Nokia Corporation
+ * Written by Timo Teras <ext-timo.teras@nokia.com>
+ *
+ * Converted to x_tables and reworked for upstream inclusion
+ * by Luciano Coelho <luciano.coelho@nokia.com>
+ *
+ * Contact: Luciano Coelho <luciano.coelho@nokia.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
+ * 02110-1301 USA
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_IDLETIMER.h>
+#include <linux/kobject.h>
+#include <linux/workqueue.h>
+#include <linux/sysfs.h>
+
+struct idletimer_tg_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct kobject *kobj,
+ struct attribute *attr, char *buf);
+};
+
+struct idletimer_tg {
+ struct list_head entry;
+ struct timer_list timer;
+ struct work_struct work;
+
+ struct kobject *kobj;
+ struct idletimer_tg_attr attr;
+
+ unsigned int refcnt;
+};
+
+static LIST_HEAD(idletimer_tg_list);
+static DEFINE_MUTEX(list_mutex);
+
+static struct kobject *idletimer_tg_kobj;
+
+static
+struct idletimer_tg *__idletimer_tg_find_by_label(const char *label)
+{
+ struct idletimer_tg *entry;
+
+ BUG_ON(!label);
+
+ list_for_each_entry(entry, &idletimer_tg_list, entry) {
+ if (!strcmp(label, entry->attr.attr.name))
+ return entry;
+ }
+
+ return NULL;
+}
+
+static ssize_t idletimer_tg_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct idletimer_tg *timer;
+ unsigned long expires = 0;
+
+ mutex_lock(&list_mutex);
+
+ timer = __idletimer_tg_find_by_label(attr->name);
+ if (timer)
+ expires = timer->timer.expires;
+
+ mutex_unlock(&list_mutex);
+
+ if (time_after(expires, jiffies))
+ return sprintf(buf, "%u\n",
+ jiffies_to_msecs(expires - jiffies) / 1000);
+
+ return sprintf(buf, "0\n");
+}
+
+static void idletimer_tg_work(struct work_struct *work)
+{
+ struct idletimer_tg *timer = container_of(work, struct idletimer_tg,
+ work);
+
+ sysfs_notify(idletimer_tg_kobj, NULL, timer->attr.attr.name);
+}
+
+static void idletimer_tg_expired(unsigned long data)
+{
+ struct idletimer_tg *timer = (struct idletimer_tg *) data;
+
+ pr_debug("timer %s expired\n", timer->attr.attr.name);
+
+ schedule_work(&timer->work);
+}
+
+static int idletimer_tg_create(struct idletimer_tg_info *info)
+{
+ int ret;
+
+ info->timer = kmalloc(sizeof(*info->timer), GFP_KERNEL);
+ if (!info->timer) {
+ pr_debug("couldn't alloc timer\n");
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ info->timer->attr.attr.name = kstrdup(info->label, GFP_KERNEL);
+ if (!info->timer->attr.attr.name) {
+ pr_debug("couldn't alloc attribute name\n");
+ ret = -ENOMEM;
+ goto out_free_timer;
+ }
+ info->timer->attr.attr.mode = S_IRUGO;
+ info->timer->attr.show = idletimer_tg_show;
+
+ ret = sysfs_create_file(idletimer_tg_kobj, &info->timer->attr.attr);
+ if (ret < 0) {
+ pr_debug("couldn't add file to sysfs");
+ goto out_free_attr;
+ }
+
+ list_add(&info->timer->entry, &idletimer_tg_list);
+
+ setup_timer(&info->timer->timer, idletimer_tg_expired,
+ (unsigned long) info->timer);
+ info->timer->refcnt = 1;
+
+ mod_timer(&info->timer->timer,
+ msecs_to_jiffies(info->timeout * 1000) + jiffies);
+
+ INIT_WORK(&info->timer->work, idletimer_tg_work);
+
+ return 0;
+
+out_free_attr:
+ kfree(info->timer->attr.attr.name);
+out_free_timer:
+ kfree(info->timer);
+out:
+ return ret;
+}
+
+/*
+ * The actual xt_tables plugin.
+ */
+static unsigned int idletimer_tg_target(struct sk_buff *skb,
+ const struct xt_action_param *par)
+{
+ const struct idletimer_tg_info *info = par->targinfo;
+
+ pr_debug("resetting timer %s, timeout period %u\n",
+ info->label, info->timeout);
+
+ BUG_ON(!info->timer);
+
+ mod_timer(&info->timer->timer,
+ msecs_to_jiffies(info->timeout * 1000) + jiffies);
+
+ return XT_CONTINUE;
+}
+
+static int idletimer_tg_checkentry(struct xt_tgchk_param *par)
+{
+ struct idletimer_tg_info *info = par->targinfo;
+ int ret;
+
+ pr_debug("checkentry targinfo%s\n", info->label);
+
+ if (info->timeout == 0) {
+ pr_debug("timeout value is zero\n");
+ return -EINVAL;
+ }
+
+ if (info->label[0] == '\0' ||
+ strnlen(info->label,
+ MAX_IDLETIMER_LABEL_SIZE) == MAX_IDLETIMER_LABEL_SIZE) {
+ pr_debug("label is empty or not nul-terminated\n");
+ return -EINVAL;
+ }
+
+ mutex_lock(&list_mutex);
+
+ info->timer = __idletimer_tg_find_by_label(info->label);
+ if (info->timer) {
+ info->timer->refcnt++;
+ mod_timer(&info->timer->timer,
+ msecs_to_jiffies(info->timeout * 1000) + jiffies);
+
+ pr_debug("increased refcnt of timer %s to %u\n",
+ info->label, info->timer->refcnt);
+ } else {
+ ret = idletimer_tg_create(info);
+ if (ret < 0) {
+ pr_debug("failed to create timer\n");
+ mutex_unlock(&list_mutex);
+ return ret;
+ }
+ }
+
+ mutex_unlock(&list_mutex);
+ return 0;
+}
+
+static void idletimer_tg_destroy(const struct xt_tgdtor_param *par)
+{
+ const struct idletimer_tg_info *info = par->targinfo;
+
+ pr_debug("destroy targinfo %s\n", info->label);
+
+ mutex_lock(&list_mutex);
+
+ if (--info->timer->refcnt == 0) {
+ pr_debug("deleting timer %s\n", info->label);
+
+ list_del(&info->timer->entry);
+ del_timer_sync(&info->timer->timer);
+ sysfs_remove_file(idletimer_tg_kobj, &info->timer->attr.attr);
+ kfree(info->timer->attr.attr.name);
+ kfree(info->timer);
+ } else {
+ pr_debug("decreased refcnt of timer %s to %u\n",
+ info->label, info->timer->refcnt);
+ }
+
+ mutex_unlock(&list_mutex);
+}
+
+static struct xt_target idletimer_tg __read_mostly = {
+ .name = "IDLETIMER",
+ .family = NFPROTO_UNSPEC,
+ .target = idletimer_tg_target,
+ .targetsize = sizeof(struct idletimer_tg_info),
+ .checkentry = idletimer_tg_checkentry,
+ .destroy = idletimer_tg_destroy,
+ .me = THIS_MODULE,
+};
+
+static struct class *idletimer_tg_class;
+
+static struct device *idletimer_tg_device;
+
+static int __init idletimer_tg_init(void)
+{
+ int err;
+
+ idletimer_tg_class = class_create(THIS_MODULE, "xt_idletimer");
+ err = PTR_ERR(idletimer_tg_class);
+ if (IS_ERR(idletimer_tg_class)) {
+ pr_debug("couldn't register device class\n");
+ goto out;
+ }
+
+ idletimer_tg_device = device_create(idletimer_tg_class, NULL,
+ MKDEV(0, 0), NULL, "timers");
+ err = PTR_ERR(idletimer_tg_device);
+ if (IS_ERR(idletimer_tg_device)) {
+ pr_debug("couldn't register system device\n");
+ goto out_class;
+ }
+
+ idletimer_tg_kobj = &idletimer_tg_device->kobj;
+
+ err = xt_register_target(&idletimer_tg);
+ if (err < 0) {
+ pr_debug("couldn't register xt target\n");
+ goto out_dev;
+ }
+
+ return 0;
+out_dev:
+ device_destroy(idletimer_tg_class, MKDEV(0, 0));
+out_class:
+ class_destroy(idletimer_tg_class);
+out:
+ return err;
+}
+
+static void __exit idletimer_tg_exit(void)
+{
+ xt_unregister_target(&idletimer_tg);
+
+ device_destroy(idletimer_tg_class, MKDEV(0, 0));
+ class_destroy(idletimer_tg_class);
+}
+
+module_init(idletimer_tg_init);
+module_exit(idletimer_tg_exit);
+
+MODULE_AUTHOR("Timo Teras <ext-timo.teras@nokia.com>");
+MODULE_AUTHOR("Luciano Coelho <luciano.coelho@nokia.com>");
+MODULE_DESCRIPTION("Xtables: idle time monitor");
+MODULE_LICENSE("GPL v2");
--
1.6.3.3
^ permalink raw reply related
* Re: [PATCH net-next-2.6 2/3] macvlan: use rx_handler_data pointer to store macvlan_port pointer
From: Patrick McHardy @ 2010-06-15 12:48 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, shemminger, eric.dumazet
In-Reply-To: <20100610133547.GE2618@psychotron.lab.eng.brq.redhat.com>
Jiri Pirko wrote:
> Register macvlan_port pointer as rx_handler data pointer. As macvlan_port is
> removed from struct net_device, another netdev priv_flag is added to indicate
> the device serves as a macvlan port.
Looks fine to me.
Acked-by: Patrick McHardy <kaber@trash.net>
^ permalink raw reply
* Re: [PATCH v6] netfilter: Xtables: idletimer target implementation
From: Patrick McHardy @ 2010-06-15 13:05 UTC (permalink / raw)
To: Luciano Coelho; +Cc: netfilter-devel, netdev, jengelh, Timo Teras
In-Reply-To: <1276605926-11598-1-git-send-email-luciano.coelho@nokia.com>
Luciano Coelho wrote:
> This patch implements an idletimer Xtables target that can be used to
> identify when interfaces have been idle for a certain period of time.
>
> Timers are identified by labels and are created when a rule is set with a new
> label. The rules also take a timeout value (in seconds) as an option. If
> more than one rule uses the same timer label, the timer will be restarted
> whenever any of the rules get a hit.
>
> One entry for each timer is created in sysfs. This attribute contains the
> timer remaining for the timer to expire. The attributes are located under
> the xt_idletimer class:
>
> /sys/class/xt_idletimer/timers/<label>
>
> When the timer expires, the target module sends a sysfs notification to the
> userspace, which can then decide what to do (eg. disconnect to save power).
>
Applied with a minor compiler warning fix:
CC [M] net/netfilter/xt_IDLETIMER.o
net/netfilter/xt_IDLETIMER.c:255: warning: initialization from
incompatible pointer type
The struct xt_tgchk_param argument to ->checkentry is const.
^ permalink raw reply
* Re: [PATCH v6] netfilter: Xtables: idletimer target implementation
From: Luciano Coelho @ 2010-06-15 13:18 UTC (permalink / raw)
To: ext Patrick McHardy
Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
jengelh@medozas.de, Timo Teras
In-Reply-To: <4C177A89.7060301@trash.net>
On Tue, 2010-06-15 at 15:05 +0200, ext Patrick McHardy wrote:
> Luciano Coelho wrote:
> > This patch implements an idletimer Xtables target that can be used to
> > identify when interfaces have been idle for a certain period of time.
> >
> > Timers are identified by labels and are created when a rule is set with a new
> > label. The rules also take a timeout value (in seconds) as an option. If
> > more than one rule uses the same timer label, the timer will be restarted
> > whenever any of the rules get a hit.
> >
> > One entry for each timer is created in sysfs. This attribute contains the
> > timer remaining for the timer to expire. The attributes are located under
> > the xt_idletimer class:
> >
> > /sys/class/xt_idletimer/timers/<label>
> >
> > When the timer expires, the target module sends a sysfs notification to the
> > userspace, which can then decide what to do (eg. disconnect to save power).
> >
>
> Applied with a minor compiler warning fix:
>
> CC [M] net/netfilter/xt_IDLETIMER.o
> net/netfilter/xt_IDLETIMER.c:255: warning: initialization from
> incompatible pointer type
>
> The struct xt_tgchk_param argument to ->checkentry is const.
Oops, missed that one. Thanks for fixing it.
And thanks a lot for bearing with a netfilter newbie, providing lots of
good comments and generally helping me with this component! I really
appreciate it.
--
Cheers,
Luca.
^ permalink raw reply
* [PATCH net-next-2.6 2/3] macvlan: use rx_handler_data pointer to store macvlan_port pointer V2
From: Jiri Pirko @ 2010-06-15 13:27 UTC (permalink / raw)
To: netdev; +Cc: davem, shemminger, kaber, eric.dumazet
In-Reply-To: <20100610133547.GE2618@psychotron.lab.eng.brq.redhat.com>
Register macvlan_port pointer as rx_handler data pointer. As macvlan_port is
removed from struct net_device, another netdev priv_flag is added to indicate
the device serves as a macvlan port.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
---
v1->v2:
- coding style corrections
drivers/net/macvlan.c | 28 ++++++++++++++++------------
include/linux/if.h | 1 +
include/linux/netdevice.h | 2 --
3 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 87a3bf6..e096875 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -40,6 +40,11 @@ struct macvlan_port {
struct rcu_head rcu;
};
+#define macvlan_port_get_rcu(dev) \
+ ((struct macvlan_port *) rcu_dereference(dev->rx_handler_data))
+#define macvlan_port_get(dev) ((struct macvlan_port *) dev->rx_handler_data)
+#define macvlan_port_exists(dev) (dev->priv_flags & IFF_MACVLAN_PORT)
+
static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
const unsigned char *addr)
{
@@ -155,7 +160,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
struct net_device *dev;
unsigned int len;
- port = rcu_dereference(skb->dev->macvlan_port);
+ port = macvlan_port_get_rcu(skb->dev);
if (is_multicast_ether_addr(eth->h_dest)) {
src = macvlan_hash_lookup(port, eth->h_source);
if (!src)
@@ -530,14 +535,12 @@ static int macvlan_port_create(struct net_device *dev)
INIT_LIST_HEAD(&port->vlans);
for (i = 0; i < MACVLAN_HASH_SIZE; i++)
INIT_HLIST_HEAD(&port->vlan_hash[i]);
- rcu_assign_pointer(dev->macvlan_port, port);
- err = netdev_rx_handler_register(dev, macvlan_handle_frame, NULL);
- if (err) {
- rcu_assign_pointer(dev->macvlan_port, NULL);
+ err = netdev_rx_handler_register(dev, macvlan_handle_frame, port);
+ if (err)
kfree(port);
- }
+ dev->priv_flags |= IFF_MACVLAN_PORT;
return err;
}
@@ -551,10 +554,10 @@ static void macvlan_port_rcu_free(struct rcu_head *head)
static void macvlan_port_destroy(struct net_device *dev)
{
- struct macvlan_port *port = dev->macvlan_port;
+ struct macvlan_port *port = macvlan_port_get(dev);
+ dev->priv_flags &= ~IFF_MACVLAN_PORT;
netdev_rx_handler_unregister(dev);
- rcu_assign_pointer(dev->macvlan_port, NULL);
call_rcu(&port->rcu, macvlan_port_rcu_free);
}
@@ -633,12 +636,12 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
if (!tb[IFLA_ADDRESS])
random_ether_addr(dev->dev_addr);
- if (lowerdev->macvlan_port == NULL) {
+ if (!macvlan_port_exists(lowerdev)) {
err = macvlan_port_create(lowerdev);
if (err < 0)
return err;
}
- port = lowerdev->macvlan_port;
+ port = macvlan_port_get(lowerdev);
vlan->lowerdev = lowerdev;
vlan->dev = dev;
@@ -748,10 +751,11 @@ static int macvlan_device_event(struct notifier_block *unused,
struct macvlan_dev *vlan, *next;
struct macvlan_port *port;
- port = dev->macvlan_port;
- if (port == NULL)
+ if (!macvlan_port_exists(dev))
return NOTIFY_DONE;
+ port = macvlan_port_get(dev);
+
switch (event) {
case NETDEV_CHANGE:
list_for_each_entry(vlan, &port->vlans, list)
diff --git a/include/linux/if.h b/include/linux/if.h
index be350e6..31f2e27 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -73,6 +73,7 @@
#define IFF_DONT_BRIDGE 0x800 /* disallow bridging this ether dev */
#define IFF_IN_NETPOLL 0x1000 /* whether we are processing netpoll */
#define IFF_DISABLE_NETPOLL 0x2000 /* disable netpoll at run-time */
+#define IFF_MACVLAN_PORT 0x4000 /* device used as macvlan port */
#define IF_GET_IFACE 0x0001 /* for querying only */
#define IF_GET_PROTO 0x0002
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb2c4e7..d8d00fa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1047,8 +1047,6 @@ struct net_device {
/* bridge stuff */
struct net_bridge_port *br_port;
- /* macvlan */
- struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port *garp_port;
--
1.7.0.1
^ permalink raw reply related
* [PATCH net-next-2.6 3/3] bridge: use rx_handler_data pointer to store net_bridge_port pointer V2
From: Jiri Pirko @ 2010-06-15 13:28 UTC (permalink / raw)
To: netdev; +Cc: davem, shemminger, kaber, eric.dumazet
In-Reply-To: <20100610133628.GF2618@psychotron.lab.eng.brq.redhat.com>
Register net_bridge_port pointer as rx_handler data pointer. As br_port is
removed from struct net_device, another netdev priv_flag is added to indicate
the device serves as a bridge port. Also rcuized pointers are now correctly
dereferenced in br_fdb.c and in netfilter parts.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
v1->v2:
coding style corrections
drivers/net/ksz884x.c | 2 +-
drivers/staging/batman-adv/hard-interface.c | 2 +-
include/linux/if.h | 1 +
include/linux/netdevice.h | 2 --
net/bridge/br_fdb.c | 4 ++--
net/bridge/br_if.c | 23 +++++++++++++----------
net/bridge/br_input.c | 9 ++++-----
net/bridge/br_netfilter.c | 11 ++++++-----
net/bridge/br_netlink.c | 9 +++++----
net/bridge/br_notify.c | 5 +++--
net/bridge/br_private.h | 5 +++++
net/bridge/br_stp_bpdu.c | 5 +++--
net/bridge/netfilter/ebt_redirect.c | 3 ++-
net/bridge/netfilter/ebt_ulog.c | 8 +++++---
net/bridge/netfilter/ebtables.c | 11 +++++++----
net/core/dev.c | 3 ++-
net/netfilter/nfnetlink_log.c | 6 ++++--
net/netfilter/nfnetlink_queue.c | 6 ++++--
net/wireless/nl80211.c | 2 +-
net/wireless/util.c | 4 ++--
20 files changed, 71 insertions(+), 50 deletions(-)
diff --git a/drivers/net/ksz884x.c b/drivers/net/ksz884x.c
index 7805bbf..e7f4703 100644
--- a/drivers/net/ksz884x.c
+++ b/drivers/net/ksz884x.c
@@ -5718,7 +5718,7 @@ static void dev_set_promiscuous(struct net_device *dev, struct dev_priv *priv,
* from the bridge.
*/
if ((hw->features & STP_SUPPORT) && !promiscuous &&
- dev->br_port) {
+ dev->priv_flags & IFF_BRIDGE_PORT) {
struct ksz_switch *sw = hw->ksz_switch;
int port = priv->port.first_port;
diff --git a/drivers/staging/batman-adv/hard-interface.c b/drivers/staging/batman-adv/hard-interface.c
index 7a582e8..5ede9c2 100644
--- a/drivers/staging/batman-adv/hard-interface.c
+++ b/drivers/staging/batman-adv/hard-interface.c
@@ -71,7 +71,7 @@ static int is_valid_iface(struct net_device *net_dev)
#endif
/* Device is being bridged */
- /* if (net_dev->br_port != NULL)
+ /* if (net_dev->priv_flags & IFF_BRIDGE_PORT)
return 0; */
return 1;
diff --git a/include/linux/if.h b/include/linux/if.h
index 31f2e27..53558ec 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -74,6 +74,7 @@
#define IFF_IN_NETPOLL 0x1000 /* whether we are processing netpoll */
#define IFF_DISABLE_NETPOLL 0x2000 /* disable netpoll at run-time */
#define IFF_MACVLAN_PORT 0x4000 /* device used as macvlan port */
+#define IFF_BRIDGE_PORT 0x8000 /* device used as bridge port */
#define IF_GET_IFACE 0x0001 /* for querying only */
#define IF_GET_PROTO 0x0002
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d8d00fa..727404d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1045,8 +1045,6 @@ struct net_device {
/* mid-layer private */
void *ml_priv;
- /* bridge stuff */
- struct net_bridge_port *br_port;
/* GARP */
struct garp_port *garp_port;
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 2663743..6818e60 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -242,11 +242,11 @@ int br_fdb_test_addr(struct net_device *dev, unsigned char *addr)
struct net_bridge_fdb_entry *fdb;
int ret;
- if (!dev->br_port)
+ if (!br_port_exists(dev))
return 0;
rcu_read_lock();
- fdb = __br_fdb_get(dev->br_port->br, addr);
+ fdb = __br_fdb_get(br_port_get_rcu(dev)->br, addr);
ret = fdb && fdb->dst->dev != dev &&
fdb->dst->state == BR_STATE_FORWARDING;
rcu_read_unlock();
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 90ac003..099495e 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -147,8 +147,9 @@ static void del_nbp(struct net_bridge_port *p)
list_del_rcu(&p->list);
+ dev->priv_flags &= ~IFF_BRIDGE_PORT;
+
netdev_rx_handler_unregister(dev);
- rcu_assign_pointer(dev->br_port, NULL);
br_multicast_del_port(p);
@@ -401,7 +402,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
return -ELOOP;
/* Device is already being bridged */
- if (dev->br_port != NULL)
+ if (br_port_exists(dev))
return -EBUSY;
/* No bridging devices that dislike that (e.g. wireless) */
@@ -429,11 +430,11 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
if (err)
goto err2;
- rcu_assign_pointer(dev->br_port, p);
-
- err = netdev_rx_handler_register(dev, br_handle_frame, NULL);
+ err = netdev_rx_handler_register(dev, br_handle_frame, p);
if (err)
- goto err3;
+ goto err2;
+
+ dev->priv_flags |= IFF_BRIDGE_PORT;
dev_disable_lro(dev);
@@ -457,8 +458,6 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
br_netpoll_enable(br, dev);
return 0;
-err3:
- rcu_assign_pointer(dev->br_port, NULL);
err2:
br_fdb_delete_by_port(br, p, 1);
err1:
@@ -475,9 +474,13 @@ put_back:
/* called with RTNL */
int br_del_if(struct net_bridge *br, struct net_device *dev)
{
- struct net_bridge_port *p = dev->br_port;
+ struct net_bridge_port *p;
+
+ if (!br_port_exists(dev))
+ return -EINVAL;
- if (!p || p->br != br)
+ p = br_port_get(dev);
+ if (p->br != br)
return -EINVAL;
del_nbp(p);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 99647d8..f076c9d 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -41,7 +41,7 @@ static int br_pass_frame_up(struct sk_buff *skb)
int br_handle_frame_finish(struct sk_buff *skb)
{
const unsigned char *dest = eth_hdr(skb)->h_dest;
- struct net_bridge_port *p = rcu_dereference(skb->dev->br_port);
+ struct net_bridge_port *p = br_port_get_rcu(skb->dev);
struct net_bridge *br;
struct net_bridge_fdb_entry *dst;
struct net_bridge_mdb_entry *mdst;
@@ -111,10 +111,9 @@ drop:
/* note: already called with rcu_read_lock (preempt_disabled) */
static int br_handle_local_finish(struct sk_buff *skb)
{
- struct net_bridge_port *p = rcu_dereference(skb->dev->br_port);
+ struct net_bridge_port *p = br_port_get_rcu(skb->dev);
- if (p)
- br_fdb_update(p->br, p, eth_hdr(skb)->h_source);
+ br_fdb_update(p->br, p, eth_hdr(skb)->h_source);
return 0; /* process further */
}
@@ -151,7 +150,7 @@ struct sk_buff *br_handle_frame(struct sk_buff *skb)
if (!skb)
return NULL;
- p = rcu_dereference(skb->dev->br_port);
+ p = br_port_get_rcu(skb->dev);
if (unlikely(is_link_local(dest))) {
/* Pause frames shouldn't be passed up by driver anyway */
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 0685b25..f54404d 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -127,16 +127,17 @@ void br_netfilter_rtable_init(struct net_bridge *br)
static inline struct rtable *bridge_parent_rtable(const struct net_device *dev)
{
- struct net_bridge_port *port = rcu_dereference(dev->br_port);
-
- return port ? &port->br->fake_rtable : NULL;
+ if (!br_port_exists(dev))
+ return NULL;
+ return &br_port_get_rcu(dev)->br->fake_rtable;
}
static inline struct net_device *bridge_parent(const struct net_device *dev)
{
- struct net_bridge_port *port = rcu_dereference(dev->br_port);
+ if (!br_port_exists(dev))
+ return NULL;
- return port ? port->br->dev : NULL;
+ return br_port_get_rcu(dev)->br->dev;
}
static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index fe0a790..4a6a378 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -120,10 +120,11 @@ static int br_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
idx = 0;
for_each_netdev(net, dev) {
/* not a bridge port */
- if (dev->br_port == NULL || idx < cb->args[0])
+ if (!br_port_exists(dev) || idx < cb->args[0])
goto skip;
- if (br_fill_ifinfo(skb, dev->br_port, NETLINK_CB(cb->skb).pid,
+ if (br_fill_ifinfo(skb, br_port_get(dev),
+ NETLINK_CB(cb->skb).pid,
cb->nlh->nlmsg_seq, RTM_NEWLINK,
NLM_F_MULTI) < 0)
break;
@@ -168,9 +169,9 @@ static int br_rtm_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
if (!dev)
return -ENODEV;
- p = dev->br_port;
- if (!p)
+ if (!br_port_exists(dev))
return -EINVAL;
+ p = br_port_get(dev);
/* if kernel STP is running, don't allow changes */
if (p->br->stp_enabled == BR_KERNEL_STP)
diff --git a/net/bridge/br_notify.c b/net/bridge/br_notify.c
index 717e1fd..404d4e1 100644
--- a/net/bridge/br_notify.c
+++ b/net/bridge/br_notify.c
@@ -32,14 +32,15 @@ struct notifier_block br_device_notifier = {
static int br_device_event(struct notifier_block *unused, unsigned long event, void *ptr)
{
struct net_device *dev = ptr;
- struct net_bridge_port *p = dev->br_port;
+ struct net_bridge_port *p = br_port_get(dev);
struct net_bridge *br;
int err;
/* not a port of a bridge */
- if (p == NULL)
+ if (!br_port_exists(dev))
return NOTIFY_DONE;
+ p = br_port_get(dev);
br = p->br;
switch (event) {
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index c83519b..5837052 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -145,6 +145,11 @@ struct net_bridge_port
#endif
};
+#define br_port_get_rcu(dev) \
+ ((struct net_bridge_port *) rcu_dereference(dev->rx_handler_data))
+#define br_port_get(dev) ((struct net_bridge_port *) dev->rx_handler_data)
+#define br_port_exists(dev) (dev->priv_flags & IFF_BRIDGE_PORT)
+
struct br_cpu_netstats {
unsigned long rx_packets;
unsigned long rx_bytes;
diff --git a/net/bridge/br_stp_bpdu.c b/net/bridge/br_stp_bpdu.c
index 217bd22..70aecb4 100644
--- a/net/bridge/br_stp_bpdu.c
+++ b/net/bridge/br_stp_bpdu.c
@@ -137,12 +137,13 @@ void br_stp_rcv(const struct stp_proto *proto, struct sk_buff *skb,
struct net_device *dev)
{
const unsigned char *dest = eth_hdr(skb)->h_dest;
- struct net_bridge_port *p = rcu_dereference(dev->br_port);
+ struct net_bridge_port *p;
struct net_bridge *br;
const unsigned char *buf;
- if (!p)
+ if (!br_port_exists(dev))
goto err;
+ p = br_port_get_rcu(dev);
if (!pskb_may_pull(skb, 4))
goto err;
diff --git a/net/bridge/netfilter/ebt_redirect.c b/net/bridge/netfilter/ebt_redirect.c
index 9e19166..46624bb 100644
--- a/net/bridge/netfilter/ebt_redirect.c
+++ b/net/bridge/netfilter/ebt_redirect.c
@@ -24,8 +24,9 @@ ebt_redirect_tg(struct sk_buff *skb, const struct xt_action_param *par)
return EBT_DROP;
if (par->hooknum != NF_BR_BROUTING)
+ /* rcu_read_lock()ed by nf_hook_slow */
memcpy(eth_hdr(skb)->h_dest,
- par->in->br_port->br->dev->dev_addr, ETH_ALEN);
+ br_port_get_rcu(par->in)->br->dev->dev_addr, ETH_ALEN);
else
memcpy(eth_hdr(skb)->h_dest, par->in->dev_addr, ETH_ALEN);
skb->pkt_type = PACKET_HOST;
diff --git a/net/bridge/netfilter/ebt_ulog.c b/net/bridge/netfilter/ebt_ulog.c
index ae3c7ce..26377e9 100644
--- a/net/bridge/netfilter/ebt_ulog.c
+++ b/net/bridge/netfilter/ebt_ulog.c
@@ -177,8 +177,9 @@ static void ebt_ulog_packet(unsigned int hooknr, const struct sk_buff *skb,
if (in) {
strcpy(pm->physindev, in->name);
/* If in isn't a bridge, then physindev==indev */
- if (in->br_port)
- strcpy(pm->indev, in->br_port->br->dev->name);
+ if (br_port_exists(in))
+ /* rcu_read_lock()ed by nf_hook_slow */
+ strcpy(pm->indev, br_port_get_rcu(in)->br->dev->name);
else
strcpy(pm->indev, in->name);
} else
@@ -187,7 +188,8 @@ static void ebt_ulog_packet(unsigned int hooknr, const struct sk_buff *skb,
if (out) {
/* If out exists, then out is a bridge port */
strcpy(pm->physoutdev, out->name);
- strcpy(pm->outdev, out->br_port->br->dev->name);
+ /* rcu_read_lock()ed by nf_hook_slow */
+ strcpy(pm->outdev, br_port_get_rcu(out)->br->dev->name);
} else
pm->outdev[0] = pm->physoutdev[0] = '\0';
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 59ca00e..bcc102e 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -140,11 +140,14 @@ ebt_basic_match(const struct ebt_entry *e, const struct ethhdr *h,
return 1;
if (FWINV2(ebt_dev_check(e->out, out), EBT_IOUT))
return 1;
- if ((!in || !in->br_port) ? 0 : FWINV2(ebt_dev_check(
- e->logical_in, in->br_port->br->dev), EBT_ILOGICALIN))
+ /* rcu_read_lock()ed by nf_hook_slow */
+ if (in && br_port_exists(in) &&
+ FWINV2(ebt_dev_check(e->logical_in, br_port_get_rcu(in)->br->dev),
+ EBT_ILOGICALIN))
return 1;
- if ((!out || !out->br_port) ? 0 : FWINV2(ebt_dev_check(
- e->logical_out, out->br_port->br->dev), EBT_ILOGICALOUT))
+ if (out && br_port_exists(out) &&
+ FWINV2(ebt_dev_check(e->logical_out, br_port_get_rcu(out)->br->dev),
+ EBT_ILOGICALOUT))
return 1;
if (e->bitmask & EBT_SOURCEMAC) {
diff --git a/net/core/dev.c b/net/core/dev.c
index abdb19e..5902426 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2765,7 +2765,8 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
if (master->priv_flags & IFF_MASTER_ARPMON)
dev->last_rx = jiffies;
- if ((master->priv_flags & IFF_MASTER_ALB) && master->br_port) {
+ if ((master->priv_flags & IFF_MASTER_ALB) &&
+ (master->priv_flags & IFF_BRIDGE_PORT)) {
/* Do address unmangle. The local destination address
* will be always the one master has. Provides the right
* functionality in a bridge.
diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index fc9a211..e0504e9 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -403,8 +403,9 @@ __build_packet_message(struct nfulnl_instance *inst,
NLA_PUT_BE32(inst->skb, NFULA_IFINDEX_PHYSINDEV,
htonl(indev->ifindex));
/* this is the bridge group "brX" */
+ /* rcu_read_lock()ed by nf_hook_slow or nf_log_packet */
NLA_PUT_BE32(inst->skb, NFULA_IFINDEX_INDEV,
- htonl(indev->br_port->br->dev->ifindex));
+ htonl(br_port_get_rcu(indev)->br->dev->ifindex));
} else {
/* Case 2: indev is bridge group, we need to look for
* physical device (when called from ipv4) */
@@ -430,8 +431,9 @@ __build_packet_message(struct nfulnl_instance *inst,
NLA_PUT_BE32(inst->skb, NFULA_IFINDEX_PHYSOUTDEV,
htonl(outdev->ifindex));
/* this is the bridge group "brX" */
+ /* rcu_read_lock()ed by nf_hook_slow or nf_log_packet */
NLA_PUT_BE32(inst->skb, NFULA_IFINDEX_OUTDEV,
- htonl(outdev->br_port->br->dev->ifindex));
+ htonl(br_port_get_rcu(outdev)->br->dev->ifindex));
} else {
/* Case 2: indev is a bridge group, we need to look
* for physical device (when called from ipv4) */
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 12e1ab3..cc3ae86 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -296,8 +296,9 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
NLA_PUT_BE32(skb, NFQA_IFINDEX_PHYSINDEV,
htonl(indev->ifindex));
/* this is the bridge group "brX" */
+ /* rcu_read_lock()ed by __nf_queue */
NLA_PUT_BE32(skb, NFQA_IFINDEX_INDEV,
- htonl(indev->br_port->br->dev->ifindex));
+ htonl(br_port_get_rcu(indev)->br->dev->ifindex));
} else {
/* Case 2: indev is bridge group, we need to look for
* physical device (when called from ipv4) */
@@ -321,8 +322,9 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
NLA_PUT_BE32(skb, NFQA_IFINDEX_PHYSOUTDEV,
htonl(outdev->ifindex));
/* this is the bridge group "brX" */
+ /* rcu_read_lock()ed by __nf_queue */
NLA_PUT_BE32(skb, NFQA_IFINDEX_OUTDEV,
- htonl(outdev->br_port->br->dev->ifindex));
+ htonl(br_port_get_rcu(outdev)->br->dev->ifindex));
} else {
/* Case 2: outdev is bridge group, we need to look for
* physical output device (when called from ipv4) */
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 90ab3c8..3a7b8a2 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -1107,7 +1107,7 @@ static int nl80211_valid_4addr(struct cfg80211_registered_device *rdev,
enum nl80211_iftype iftype)
{
if (!use_4addr) {
- if (netdev && netdev->br_port)
+ if (netdev && (netdev->priv_flags & IFF_BRIDGE_PORT))
return -EBUSY;
return 0;
}
diff --git a/net/wireless/util.c b/net/wireless/util.c
index 3416373..0c8a1e8 100644
--- a/net/wireless/util.c
+++ b/net/wireless/util.c
@@ -770,8 +770,8 @@ int cfg80211_change_iface(struct cfg80211_registered_device *rdev,
return -EOPNOTSUPP;
/* if it's part of a bridge, reject changing type to station/ibss */
- if (dev->br_port && (ntype == NL80211_IFTYPE_ADHOC ||
- ntype == NL80211_IFTYPE_STATION))
+ if ((dev->priv_flags & IFF_BRIDGE_PORT) &&
+ (ntype == NL80211_IFTYPE_ADHOC || ntype == NL80211_IFTYPE_STATION))
return -EBUSY;
if (ntype != otype) {
--
1.7.0.1
^ permalink raw reply related
* [PATCH net-next-2.6 v2] net: Introduce u64_stats_sync infrastructure
From: Eric Dumazet @ 2010-06-15 13:29 UTC (permalink / raw)
To: Nick Piggin; +Cc: David Miller, netdev, bhutchings
In-Reply-To: <20100615110413.GJ6138@laptop>
Le mardi 15 juin 2010 à 21:04 +1000, Nick Piggin a écrit :
> On Tue, Jun 15, 2010 at 12:43:25PM +0200, Eric Dumazet wrote:
> > I'll submit a v2 patch after my lunch to add all your comments, because
> > all clarifications are indeed very very welcomed !
>
> Thanks!
Here is second version of the patch, with infrastructure only.
I chose to add a new include file, and document API in same file instead
of Documentation/somefile
Once accepted, I'll provide the loopback driver update.
Thanks !
[PATCH net-next-2.6 v2] net: Introduce u64_stats_sync infrastructure
To properly implement 64bits network statistics on 32bit or 64bit hosts,
we provide one new type and four methods, to ease conversions.
Stats producer should use following template granted it already got an
exclusive access to counters (include/linux/u64_stats_sync.h contains
some documentation about details)
u64_stats_update_begin(&stats->syncp);
stats->bytes64 += len;
stats->packets64++;
u64_stats_update_end(&stats->syncp);
While a consumer should use following template to get consistent
snapshot :
u64 tbytes, tpackets;
unsigned int start;
do {
start = u64_stats_fetch_begin(&stats->syncp);
tbytes = stats->bytes64;
tpackets = stats->packets64;
} while (u64_stats_fetch_retry(&stats->lock, syncp));
Suggested by David Miller, and comments courtesy of Nick Piggin.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Nick Piggin <npiggin@suse.de>
---
include/linux/u64_stats_sync.h | 107 +++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)
diff --git a/include/linux/u64_stats_sync.h b/include/linux/u64_stats_sync.h
index e69de29..5a4f318 100644
--- a/include/linux/u64_stats_sync.h
+++ b/include/linux/u64_stats_sync.h
@@ -0,0 +1,107 @@
+#ifndef _LINUX_U64_STATS_SYNC_H
+#define _LINUX_U64_STATS_SYNC_H
+
+/*
+ * To properly implement 64bits network statistics on 32bit and 64bit hosts,
+ * we provide a synchronization point, that is a noop on 64bit or UP kernels.
+ *
+ * Key points :
+ * 1) Use a seqcount on SMP 32bits, with low overhead.
+ * 2) Whole thing is a noop on 64bit arches or UP kernels.
+ * 3) Write side must ensure mutual exclusion or one seqcount update could
+ * be lost, thus blocking readers forever.
+ * If this synchronization point is not a mutex, but a spinlock or
+ * spinlock_bh() or disable_bh() :
+ * 3.1) Write side should not sleep.
+ * 3.2) Write side should not allow preemption.
+ * 3.3) If applicable, interrupts should be disabled.
+ *
+ * 4) If reader fetches several counters, there is no guarantee the whole values
+ * are consistent (remember point 1) : this is a noop on 64bit arches anyway)
+ *
+ * 5) readers are allowed to sleep or be preempted/interrupted : They perform
+ * pure reads. But if they have to fetch many values, it's better to not allow
+ * preemptions/interruptions to avoid many retries.
+ *
+ * Usage :
+ *
+ * Stats producer (writer) should use following template granted it already got
+ * an exclusive access to counters (a lock is already taken, or per cpu
+ * data is used [in a non preemptable context])
+ *
+ * spin_lock_bh(...) or other synchronization to get exclusive access
+ * ...
+ * u64_stats_update_begin(&stats->syncp);
+ * stats->bytes64 += len; // non atomic operation
+ * stats->packets64++; // non atomic operation
+ * u64_stats_update_end(&stats->syncp);
+ *
+ * While a consumer (reader) should use following template to get consistent
+ * snapshot for each variable (but no guarantee on several ones)
+ *
+ * u64 tbytes, tpackets;
+ * unsigned int start;
+ *
+ * do {
+ * start = u64_stats_fetch_begin(&stats->syncp);
+ * tbytes = stats->bytes64; // non atomic operation
+ * tpackets = stats->packets64; // non atomic operation
+ * } while (u64_stats_fetch_retry(&stats->lock, syncp));
+ *
+ *
+ * Example of use in drivers/net/loopback.c, using per_cpu containers,
+ * in BH disabled context.
+ */
+#include <linux/seqlock.h>
+
+#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+struct u64_stats_sync {
+ seqcount_t seq;
+};
+
+static void inline u64_stats_update_begin(struct u64_stats_sync *syncp)
+{
+ write_seqcount_begin(&syncp->seq);
+}
+
+static void inline u64_stats_update_end(struct u64_stats_sync *syncp)
+{
+ write_seqcount_end(&syncp->seq);
+}
+
+static unsigned int inline u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
+{
+ return read_seqcount_begin(&syncp->seq);
+}
+
+static bool inline u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
+ unsigned int start)
+{
+ return read_seqcount_retry(&syncp->seq, start);
+}
+
+#else
+struct u64_stats_sync {
+};
+
+static void inline u64_stats_update_begin(struct u64_stats_sync *syncp)
+{
+}
+
+static void inline u64_stats_update_end(struct u64_stats_sync *syncp)
+{
+}
+
+static unsigned int inline u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
+{
+ return 0;
+}
+
+static bool inline u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
+ unsigned int start)
+{
+ return false;
+}
+#endif
+
+#endif /* _LINUX_U64_STATS_SYNC_H */
^ permalink raw reply related
* [PATCH 1/1] ipvs: Add missing locking during connection table hashing and unhashing
From: kaber @ 2010-06-15 14:48 UTC (permalink / raw)
To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1276613285-4959-1-git-send-email-kaber@trash.net>
From: Sven Wegener <sven.wegener@stealer.net>
The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
net/netfilter/ipvs/ip_vs_conn.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index d8f7e8e..ff04e9e 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -162,6 +162,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
ct_write_lock(hash);
+ spin_lock(&cp->lock);
if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
list_add(&cp->c_list, &ip_vs_conn_tab[hash]);
@@ -174,6 +175,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
ret = 0;
}
+ spin_unlock(&cp->lock);
ct_write_unlock(hash);
return ret;
@@ -193,6 +195,7 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
ct_write_lock(hash);
+ spin_lock(&cp->lock);
if (cp->flags & IP_VS_CONN_F_HASHED) {
list_del(&cp->c_list);
@@ -202,6 +205,7 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
} else
ret = 0;
+ spin_unlock(&cp->lock);
ct_write_unlock(hash);
return ret;
--
1.7.0.4
^ permalink raw reply related
* [PATCH 0/1] netfilter: netfilter fixes
From: kaber @ 2010-06-15 14:48 UTC (permalink / raw)
To: davem; +Cc: netfilter-devel, netdev
Hi Dave,
following is a single netfilter fix for IPVS from Sven Wegener, fixing a race
condition that might result in an oops.
Please apply or pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git master
Thanks!
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox