* Re: CODEL et al.
From: David Miller @ 2012-05-22 21:49 UTC (permalink / raw)
To: shemminger; +Cc: netdev
In-Reply-To: <20120522142800.0fcd378e@nehalam.linuxnetplumber.net>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 22 May 2012 14:28:00 -0700
> Since codel is now merged, just finished merging all the outstanding
> iproute2 patches for 3.5.
Thanks a lot Stephen.
^ permalink raw reply
* [PATCH 1/2] iproute2: Improve list add.
From: Pravin B Shelar @ 2012-05-22 22:37 UTC (permalink / raw)
To: shemminger, netdev; +Cc: jpettit, jesse, Pravin B Shelar
ip command reads entire list of devices on every flush command.
While adding device record to list is does list traversal O(n).
This is time consuming for large batch commands.
Following patch improves list add operation to O(1).
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
---
ip/ipaddress.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 9ab65ec..7080c41 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -714,6 +714,7 @@ int print_addrinfo_secondary(const struct sockaddr_nl *who, struct nlmsghdr *n,
struct nlmsg_list
{
struct nlmsg_list *next;
+ struct nlmsg_list *last;
struct nlmsghdr h;
};
@@ -744,17 +745,24 @@ static int store_nlmsg(const struct sockaddr_nl *who, struct nlmsghdr *n,
{
struct nlmsg_list **linfo = (struct nlmsg_list**)arg;
struct nlmsg_list *h;
- struct nlmsg_list **lp;
- h = malloc(n->nlmsg_len+sizeof(void*));
+ h = malloc(n->nlmsg_len+sizeof(void*)+sizeof(void*));
if (h == NULL)
return -1;
memcpy(&h->h, n, n->nlmsg_len);
h->next = NULL;
- for (lp = linfo; *lp; lp = &(*lp)->next) /* NOTHING */;
- *lp = h;
+ if (!*linfo) {
+ /* First element. */
+ h->last = h;
+ *linfo = h;
+ } else {
+ struct nlmsg_list *last = (*linfo)->last;
+
+ last->next = h;
+ (*linfo)->last = h;
+ }
ll_remember_index(who, n, NULL);
return 0;
--
1.7.10
^ permalink raw reply related
* [PATCH 2/2] iproute2: Fix memory hog of ip batched command.
From: Pravin B Shelar @ 2012-05-22 22:37 UTC (permalink / raw)
To: shemminger, netdev; +Cc: jpettit, jesse, Pravin B Shelar
ipaddr_list_or_flush() builds list of all device at start of
every flush or list operation, but does not free memory at end.
This can hog lot of memory for large batched command.
Following patch fixes it.
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
---
ip/ipaddress.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 7080c41..09444e6 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -775,6 +775,8 @@ static int ipaddr_list_or_flush(int argc, char **argv, int flush)
struct nlmsg_list *l, *n;
char *filter_dev = NULL;
int no_link = 0;
+ int rc = 0;
+ int print_info = 0;
ipaddr_reset_filter(oneline);
filter.showqueue = 1;
@@ -877,7 +879,8 @@ static int ipaddr_list_or_flush(int argc, char **argv, int flush)
filter.ifindex = ll_name_to_index(filter_dev);
if (filter.ifindex <= 0) {
fprintf(stderr, "Device \"%s\" does not exist.\n", filter_dev);
- return -1;
+ rc = -1;
+ goto out;
}
}
@@ -922,11 +925,14 @@ flush_done:
printf("*** Flush is complete after %d round%s ***\n", round, round>1?"s":"");
}
fflush(stdout);
- return 0;
+ rc = 0;
+ goto out;
}
round++;
- if (flush_update() < 0)
- return 1;
+ if (flush_update() < 0) {
+ rc = 1;
+ goto out;
+ }
if (show_stats) {
printf("\n*** Round %d, deleting %d addresses ***\n", round, filter.flushed);
@@ -943,7 +949,8 @@ flush_done:
}
fprintf(stderr, "*** Flush remains incomplete after %d rounds. ***\n", max_flush_loops);
fflush(stderr);
- return 1;
+ rc = 1;
+ goto out;
}
if (filter.family != AF_PACKET) {
@@ -1017,19 +1024,21 @@ flush_done:
lp = &l->next;
}
}
-
+ print_info = 1;
+out:
for (l=linfo; l; l = n) {
n = l->next;
- if (no_link || print_linkinfo(NULL, &l->h, stdout) == 0) {
+ if (print_info &&
+ (no_link || print_linkinfo(NULL, &l->h, stdout) == 0)) {
struct ifinfomsg *ifi = NLMSG_DATA(&l->h);
if (filter.family != AF_PACKET)
print_selected_addrinfo(ifi->ifi_index, ainfo, stdout);
+ fflush(stdout);
}
- fflush(stdout);
free(l);
}
- return 0;
+ return rc;
}
int ipaddr_list_link(int argc, char **argv)
--
1.7.10
^ permalink raw reply related
* Re: [PATCH v6 2/2] decrement static keys on real destroy time
From: Andrew Morton @ 2012-05-22 22:46 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
Johannes Weiner, Michal Hocko, David Miller
In-Reply-To: <1337682339-21282-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
(cc davem)
On Tue, 22 May 2012 14:25:39 +0400
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> We call the destroy function when a cgroup starts to be removed,
> such as by a rmdir event.
>
> However, because of our reference counters, some objects are still
> inflight. Right now, we are decrementing the static_keys at destroy()
> time, meaning that if we get rid of the last static_key reference,
> some objects will still have charges, but the code to properly
> uncharge them won't be run.
>
> This becomes a problem specially if it is ever enabled again, because
> now new charges will be added to the staled charges making keeping
> it pretty much impossible.
>
> We just need to be careful with the static branch activation:
> since there is no particular preferred order of their activation,
> we need to make sure that we only start using it after all
> call sites are active. This is achieved by having a per-memcg
> flag that is only updated after static_key_slow_inc() returns.
> At this time, we are sure all sites are active.
>
> This is made per-memcg, not global, for a reason:
> it also has the effect of making socket accounting more
> consistent. The first memcg to be limited will trigger static_key()
> activation, therefore, accounting. But all the others will then be
> accounted no matter what. After this patch, only limited memcgs
> will have its sockets accounted.
>
> [v2: changed a tcp limited flag for a generic proto limited flag ]
> [v3: update the current active flag only after the static_key update ]
> [v4: disarm_static_keys() inside free_work ]
> [v5: got rid of tcp_limit_mutex, now in the static_key interface ]
> [v6: changed active and activated to a flags field, as suggested by akpm ]
A few things...
> include/linux/memcontrol.h | 5 +++++
> include/net/sock.h | 11 +++++++++++
> mm/memcontrol.c | 29 +++++++++++++++++++++++++++--
> net/ipv4/tcp_memcontrol.c | 34 +++++++++++++++++++++++++++-------
> 4 files changed, 70 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f94efd2..9dc0b86 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -436,6 +436,11 @@ enum {
> OVER_LIMIT,
> };
>
> +enum sock_flag_bits {
> + MEMCG_SOCK_ACTIVE,
> + MEMCG_SOCK_ACTIVATED,
> +};
I don't see why this was defined in memcontrol.h. It is enumerating
the bits in sock.h's cg_proto.flags, so why not define it in sock.h?
This is changed in the appended patch.
Also, in the v5 patch these flags were documented, as they should be.
Version 6 forgot to do this. This is changed in the appended patch.
And version 6 doesn't describe what sock_flag_bits actually does. It
should. This is changed in the appended patch.
And the name seems inappropriate to me. Should it not be enum
cg_proto_flag_bits? Or, probably better, cg_proto_flags? This I did
*not* change.
> struct sock;
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> void sock_update_memcg(struct sock *sk);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index b3ebe6b..1742db7 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -913,6 +913,7 @@ struct cg_proto {
> struct percpu_counter *sockets_allocated; /* Current number of sockets. */
> int *memory_pressure;
> long *sysctl_mem;
> + unsigned long flags;
> /*
> * memcg field is used to find which memcg we belong directly
> * Each memcg struct can hold more than one cg_proto, so container_of
> @@ -928,6 +929,16 @@ struct cg_proto {
> extern int proto_register(struct proto *prot, int alloc_slab);
> extern void proto_unregister(struct proto *prot);
>
> +static inline bool memcg_proto_active(struct cg_proto *cg_proto)
> +{
> + return cg_proto->flags & (1 << MEMCG_SOCK_ACTIVE);
> +}
> +
> +static inline bool memcg_proto_activated(struct cg_proto *cg_proto)
> +{
> + return cg_proto->flags & (1 << MEMCG_SOCK_ACTIVATED);
> +}
Here, we're open-coding kinda-test_bit(). Why do that? These flags are
modified with set_bit() and friends, so we should read them with the
matching test_bit()?
Also, these bool-returning functions will return values other than 0
and 1. That probably works OK and I don't know what the C standards
and implementations do about this. But it seems unclean and slightly
risky to have a "bool" value of 32! Converting these functions to use
test_bit() fixes this - test_bit() returns only 0 or 1.
test_bit() is slightly more expensive than the above. If this is
considered to be an issue then I guess we could continue to use this
approach. But I do think a code comment is needed, explaining and
justifying the unusual decision to bypass the bitops API. Also these
functions should tell the truth and return an "int" type.
> #ifdef SOCK_REFCNT_DEBUG
> static inline void sk_refcnt_debug_inc(struct sock *sk)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0b4b4c8..22434bf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -404,6 +404,7 @@ void sock_update_memcg(struct sock *sk)
> {
> if (mem_cgroup_sockets_enabled) {
> struct mem_cgroup *memcg;
> + struct cg_proto *cg_proto;
>
> BUG_ON(!sk->sk_prot->proto_cgroup);
>
> @@ -423,9 +424,10 @@ void sock_update_memcg(struct sock *sk)
>
> rcu_read_lock();
> memcg = mem_cgroup_from_task(current);
> - if (!mem_cgroup_is_root(memcg)) {
> + cg_proto = sk->sk_prot->proto_cgroup(memcg);
> + if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
> mem_cgroup_get(memcg);
> - sk->sk_cgrp = sk->sk_prot->proto_cgroup(memcg);
> + sk->sk_cgrp = cg_proto;
> }
> rcu_read_unlock();
> }
> @@ -451,9 +453,25 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
> return &memcg->tcp_mem.cg_proto;
> }
> EXPORT_SYMBOL(tcp_proto_cgroup);
> +
> +static void disarm_sock_keys(struct mem_cgroup *memcg)
> +{
> + if (!memcg_proto_activated(&memcg->tcp_mem.cg_proto))
> + return;
> + static_key_slow_dec(&memcg_socket_limit_enabled);
> +}
> +#else
> +static void disarm_sock_keys(struct mem_cgroup *memcg)
> +{
> +}
> #endif /* CONFIG_INET */
> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>
> +static void disarm_static_keys(struct mem_cgroup *memcg)
> +{
> + disarm_sock_keys(memcg);
> +}
Why does this function exist? Its single caller could call
disarm_sock_keys() directly.
> static void drain_all_stock_async(struct mem_cgroup *memcg);
>
> static struct mem_cgroup_per_zone *
> @@ -4836,6 +4854,13 @@ static void free_work(struct work_struct *work)
> int size = sizeof(struct mem_cgroup);
>
> memcg = container_of(work, struct mem_cgroup, work_freeing);
> + /*
> + * We need to make sure that (at least for now), the jump label
> + * destruction code runs outside of the cgroup lock.
This is a poor comment - it failed to tell the reader *why* that code
must run outside the cgroup lock.
> schedule_work()
> + * will guarantee this happens. Be careful if you need to move this
> + * disarm_static_keys around
It's a bit difficult for the reader to be careful when he isn't told
what the risks are.
> + */
> + disarm_static_keys(memcg);
> if (size < PAGE_SIZE)
> kfree(memcg);
> else
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 1517037..3b8fa25 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -74,9 +74,6 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
> percpu_counter_destroy(&tcp->tcp_sockets_allocated);
>
> val = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT);
> -
> - if (val != RESOURCE_MAX)
> - static_key_slow_dec(&memcg_socket_limit_enabled);
> }
> EXPORT_SYMBOL(tcp_destroy_cgroup);
>
> @@ -107,10 +104,33 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> tcp->tcp_prot_mem[i] = min_t(long, val >> PAGE_SHIFT,
> net->ipv4.sysctl_tcp_mem[i]);
>
> - if (val == RESOURCE_MAX && old_lim != RESOURCE_MAX)
> - static_key_slow_dec(&memcg_socket_limit_enabled);
> - else if (old_lim == RESOURCE_MAX && val != RESOURCE_MAX)
> - static_key_slow_inc(&memcg_socket_limit_enabled);
> + if (val == RESOURCE_MAX)
> + clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> + else if (val != RESOURCE_MAX) {
> + /*
> + * The active bit needs to be written after the static_key update.
> + * This is what guarantees that the socket activation function
> + * is the last one to run. See sock_update_memcg() for details,
> + * and note that we don't mark any socket as belonging to this
> + * memcg until that flag is up.
> + *
> + * We need to do this, because static_keys will span multiple
> + * sites, but we can't control their order. If we mark a socket
> + * as accounted, but the accounting functions are not patched in
> + * yet, we'll lose accounting.
> + *
> + * We never race with the readers in sock_update_memcg(), because
> + * when this value change, the code to process it is not patched in
> + * yet.
> + *
> + * The activated bit is used to guarantee that no two writers will
> + * do the update in the same memcg. Without that, we can't properly
> + * shutdown the static key.
> + */
This comment needlessly overflows 80 cols and has a pointless and
unconventional double-space indenting. I already provided a patch
which fixes this and a few other things, but that was ignored when you
did the v6.
> + if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
> + static_key_slow_inc(&memcg_socket_limit_enabled);
> + set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> + }
So here are suggested changes from *some* of the above discussion.
Please consider, incorporate, retest and send us a v7?
From: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: memcg-decrement-static-keys-at-real-destroy-time-v6-fix
- move enum sock_flag_bits into sock.h
- document enum sock_flag_bits
- convert memcg_proto_active() and memcg_proto_activated() to test_bit()
- redo tcp_update_limit() comment to 80 cols
Cc: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
include/linux/memcontrol.h | 5 -----
include/net/sock.h | 15 +++++++++++++--
net/ipv4/tcp_memcontrol.c | 30 +++++++++++++++---------------
3 files changed, 28 insertions(+), 22 deletions(-)
diff -puN include/linux/memcontrol.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
+++ a/include/linux/memcontrol.h
@@ -405,11 +405,6 @@ enum {
OVER_LIMIT,
};
-enum sock_flag_bits {
- MEMCG_SOCK_ACTIVE,
- MEMCG_SOCK_ACTIVATED,
-};
-
struct sock;
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
void sock_update_memcg(struct sock *sk);
diff -puN include/net/sock.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix include/net/sock.h
--- a/include/net/sock.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
+++ a/include/net/sock.h
@@ -46,6 +46,7 @@
#include <linux/list_nulls.h>
#include <linux/timer.h>
#include <linux/cache.h>
+#include <linux/bitops.h>
#include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
@@ -921,6 +922,16 @@ struct proto {
#endif
};
+/*
+ * Bits in struct cg_proto.flags
+ */
+enum sock_flag_bits {
+ /* Currently active and new sockets should be assigned to cgroups */
+ MEMCG_SOCK_ACTIVE,
+ /* It was ever activated; we must disarm static keys on destruction */
+ MEMCG_SOCK_ACTIVATED,
+};
+
struct cg_proto {
void (*enter_memory_pressure)(struct sock *sk);
struct res_counter *memory_allocated; /* Current allocated memory. */
@@ -945,12 +956,12 @@ extern void proto_unregister(struct prot
static inline bool memcg_proto_active(struct cg_proto *cg_proto)
{
- return cg_proto->flags & (1 << MEMCG_SOCK_ACTIVE);
+ return test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
}
static inline bool memcg_proto_activated(struct cg_proto *cg_proto)
{
- return cg_proto->flags & (1 << MEMCG_SOCK_ACTIVATED);
+ return test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags);
}
#ifdef SOCK_REFCNT_DEBUG
diff -puN net/ipv4/tcp_memcontrol.c~memcg-decrement-static-keys-at-real-destroy-time-v6-fix net/ipv4/tcp_memcontrol.c
--- a/net/ipv4/tcp_memcontrol.c~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
+++ a/net/ipv4/tcp_memcontrol.c
@@ -108,24 +108,24 @@ static int tcp_update_limit(struct mem_c
clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
else if (val != RESOURCE_MAX) {
/*
- * The active bit needs to be written after the static_key update.
- * This is what guarantees that the socket activation function
- * is the last one to run. See sock_update_memcg() for details,
- * and note that we don't mark any socket as belonging to this
- * memcg until that flag is up.
+ * The active bit needs to be written after the static_key
+ * update. This is what guarantees that the socket activation
+ * function is the last one to run. See sock_update_memcg() for
+ * details, and note that we don't mark any socket as belonging
+ * to this memcg until that flag is up.
*
- * We need to do this, because static_keys will span multiple
- * sites, but we can't control their order. If we mark a socket
- * as accounted, but the accounting functions are not patched in
- * yet, we'll lose accounting.
+ * We need to do this, because static_keys will span multiple
+ * sites, but we can't control their order. If we mark a socket
+ * as accounted, but the accounting functions are not patched in
+ * yet, we'll lose accounting.
*
- * We never race with the readers in sock_update_memcg(), because
- * when this value change, the code to process it is not patched in
- * yet.
+ * We never race with the readers in sock_update_memcg(),
+ * because when this value change, the code to process it is not
+ * patched in yet.
*
- * The activated bit is used to guarantee that no two writers will
- * do the update in the same memcg. Without that, we can't properly
- * shutdown the static key.
+ * The activated bit is used to guarantee that no two writers
+ * will do the update in the same memcg. Without that, we can't
+ * properly shutdown the static key.
*/
if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
static_key_slow_inc(&memcg_socket_limit_enabled);
_
^ permalink raw reply
* Re: [PATCH v6 2/2] decrement static keys on real destroy time
From: Andrew Morton @ 2012-05-22 23:11 UTC (permalink / raw)
To: Glauber Costa, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
cgroups-u79uwXL29TY76Z2rM5mHXA, devel-GEFAQzZX7r8dnm+yROfE0A,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
Johannes Weiner, Michal Hocko, David Miller, Joe Perches
In-Reply-To: <20120522154610.f2f9b78e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
On Tue, 22 May 2012 15:46:10 -0700
Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > +static inline bool memcg_proto_active(struct cg_proto *cg_proto)
> > +{
> > + return cg_proto->flags & (1 << MEMCG_SOCK_ACTIVE);
> > +}
> > +
> > +static inline bool memcg_proto_activated(struct cg_proto *cg_proto)
> > +{
> > + return cg_proto->flags & (1 << MEMCG_SOCK_ACTIVATED);
> > +}
>
> Here, we're open-coding kinda-test_bit(). Why do that? These flags are
> modified with set_bit() and friends, so we should read them with the
> matching test_bit()?
>
> Also, these bool-returning functions will return values other than 0
> and 1. That probably works OK and I don't know what the C standards
> and implementations do about this. But it seems unclean and slightly
> risky to have a "bool" value of 32! Converting these functions to use
> test_bit() fixes this - test_bit() returns only 0 or 1.
>
> test_bit() is slightly more expensive than the above. If this is
> considered to be an issue then I guess we could continue to use this
> approach. But I do think a code comment is needed, explaining and
> justifying the unusual decision to bypass the bitops API. Also these
> functions should tell the truth and return an "int" type.
Joe corrected (and informed) me:
: 6.3.1.2p1:
:
: "When any scalar value is converted to _Bool, the result is 0
: if the value compares equal to 0; otherwise, the result is 1."
So the above functions will be given compiler-generated scalar-to-boolean
conversion.
test_bit() already does internal scalar-to-boolean conversion. The
compiler doesn't know that, so if we convert the above functions to use
test_bit(), we'll end up performing scalar-to-boolean-to-boolean
conversion, which is dumb.
I assume that a way of fixing this is to change test_bit() to return
bool type. That's a bit scary.
A less scary way would be to add a new
bool test_bit_bool(int nr, const unsigned long *addr);
which internally calls test_bit() but somehow avoids the
compiler-generated conversion of the test_bit() return value into a
bool. I haven't actually thought of a way of doing this ;)
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Split out printing of hex data
From: Ben Hutchings @ 2012-05-22 23:58 UTC (permalink / raw)
To: Stuart Hodgson; +Cc: netdev, Yaniv Rosner, David Miller, Eilon Greenstein
In-Reply-To: <4FB66390.2010704@solarflare.com>
On Fri, 2012-05-18 at 15:58 +0100, Stuart Hodgson wrote:
> Split out printing of hex data to common function from
> dump_regs and dump_eeprom. Ready for use by module
> eeprom dumping.
>
> Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
> ---
> ethtool.c | 35 ++++++++++++++++++-----------------
> 1 files changed, 18 insertions(+), 17 deletions(-)
>
> diff --git a/ethtool.c b/ethtool.c
> index e80b38b..fdc21de 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -787,6 +787,20 @@ static const struct {
> { "st_gmac", st_gmac_dump_regs },
> };
>
> +static void dump_hex(__u8 *data, int len, int offset)
> +{
> + int i;
> +
> + fprintf(stdout, "Offset\tValues\n");
> + fprintf(stdout, "--------\t-----");
> + for (i = 0; i < len; i++) {
> + if (i%16 == 0)
> + fprintf(stdout, "\n0x%04x:\t", i + offset);
> + fprintf(stdout, " %02x", data[i]);
> + }
The problem with this (inherited from dump_regs()) is that the columns
don't line up. So I've applied this but changed dump_hex() to look more
like the code removed from dump_eeprom().
Ben.
> + fprintf(stdout, "\n\n");
> +}
> +
> static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
> const char *gregs_dump_file,
> struct ethtool_drvinfo *info, struct ethtool_regs *regs)
> @@ -820,22 +834,14 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
> ETHTOOL_BUSINFO_LEN))
> return driver_list[i].func(info, regs);
>
> - fprintf(stdout, "Offset\tValues\n");
> - fprintf(stdout, "--------\t-----");
> - for (i = 0; i < regs->len; i++) {
> - if (i%16 == 0)
> - fprintf(stdout, "\n%03x:\t", i);
> - fprintf(stdout, " %02x", regs->data[i]);
> - }
> - fprintf(stdout, "\n\n");
> + dump_hex(regs->data, regs->len, 0);
> +
> return 0;
> }
>
> static int dump_eeprom(int geeprom_dump_raw, struct ethtool_drvinfo *info,
> struct ethtool_eeprom *ee)
> {
> - int i;
> -
> if (geeprom_dump_raw) {
> fwrite(ee->data, 1, ee->len, stdout);
> return 0;
> @@ -847,13 +853,8 @@ static int dump_eeprom(int geeprom_dump_raw, struct ethtool_drvinfo *info,
> return tg3_dump_eeprom(info, ee);
> }
>
> - fprintf(stdout, "Offset\t\tValues\n");
> - fprintf(stdout, "------\t\t------");
> - for (i = 0; i < ee->len; i++) {
> - if(!(i%16)) fprintf(stdout, "\n0x%04x\t\t", i + ee->offset);
> - fprintf(stdout, "%02x ", ee->data[i]);
> - }
> - fprintf(stdout, "\n");
> + dump_hex(ee->data, ee->len, ee->offset);
> +
> return 0;
> }
>
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH iproute2] tc_codel: Controlled Delay AQM
From: Dave Taht @ 2012-05-22 23:59 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Eric Dumazet, netdev, Dave Taht
In-Reply-To: <20120522141610.601f68ed@nehalam.linuxnetplumber.net>
I promised eric I'd write the two man pages.
I also promised myself a vacation. The last year has been a long hard
road. If someone
hasn't got to writing the man pages by the time I get back from lupin,
I'll write it.
On Tue, May 22, 2012 at 2:16 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Fri, 11 May 2012 08:22:35 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> From: Eric Dumazet <edumazet@google.com>
>>
>> An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
>>
>> http://queue.acm.org/detail.cfm?id=2209336
>>
>> This AQM main input is no longer queue size in bytes or packets, but the
>> delay packets stay in (FIFO) queue.
>>
>> As we don't have infinite memory, we still can drop packets in enqueue()
>> in case of massive load, but mean of CoDel is to drop packets in
>> dequeue(), using a control law based on two simple parameters :
>>
>> target : target sojourn time (default 5ms)
>> interval : width of moving time window (default 100ms)
>>
>> Selected packets are dropped, unless ECN is enabled and packets can get
>> ECN mark instead.
>>
>> Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
>> [ interval TIME ] [ ecn ]
>>
>> qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
>> Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
>> rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
>> count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
>> maxpacket 1514 ecn_mark 84399 drop_overlimit 0
>>
>> CoDel must be seen as a base module, and should be used keeping in mind
>> there is still a FIFO queue. So a typical setup will probably need a
>> hierarchy of several qdiscs and packet classifiers to be able to meet
>> whatever constraints a user might have.
>>
>> One possible example would be to use fq_codel, which combines Fair
>> Queueing and CoDel, in replacement of sfq / sfq_red.
>>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
>> ---
>> Notes :
>> 1) : Dave Taht will send a nice man-page for this stuff.
>> 2) : the TCA_NETEM_ECN bit is because of include/linux/pkt_sched.h sync
>> with net-next
>> (I'll send a separate patch for netem)
>>
>
> Applied. Used 3.5 sanitized header (not the one in your patch),
> and fixed whitespace error.
>
>
> Ok, where's the man page :-)
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://www.bufferbloat.net
^ permalink raw reply
* Re: [PATCH 3/3] ethtool: Addition of -m option to dump module eeprom
From: Ben Hutchings @ 2012-05-23 0:01 UTC (permalink / raw)
To: Stuart Hodgson; +Cc: netdev, David Miller, Yaniv Rosner, Eilon Greenstein
In-Reply-To: <4FB663A5.3060107@solarflare.com>
On Fri, 2012-05-18 at 15:58 +0100, Stuart Hodgson wrote:
> The -m option now allows for retrieval of EEPROM
> information form a plug in module such as SFP+. This
> shows specific information about the type and
> capabilities of the module in use The format can be
> easily extended to support other modules types such as
> QSFP in future. Raw data dump is also supported.
[...]
Applied, thanks.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* NETDEV WATCHDOG: %s (%s): transmit queue %u timed out
From: George Spelvin @ 2012-05-23 1:03 UTC (permalink / raw)
To: netdev; +Cc: linux
After installing a 3.4 kernel, I got the following:
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0xe9/0x15c()
Hardware name: MS-7376
NETDEV WATCHDOG: inside (r8169): transmit queue 0 timed out
Pid: 0, comm: swapper/3 Not tainted 3.4.0-00017-g3df9c78 #152
Call Trace:
<IRQ> [<ffffffff81311aba>] ? dev_watchdog+0xe9/0x15c
[<ffffffff81024499>] ? warn_slowpath_common+0x71/0x85
[<ffffffff813119d1>] ? netif_tx_lock+0x7a/0x7a
[<ffffffff81024511>] ? warn_slowpath_fmt+0x45/0x4a
[<ffffffff813119be>] ? netif_tx_lock+0x67/0x7a
[<ffffffff81311aba>] ? dev_watchdog+0xe9/0x15c
[<ffffffff810345ab>] ? __queue_work+0x20a/0x20a
[<ffffffff8102c908>] ? run_timer_softirq+0x17e/0x20b
[<ffffffff81028889>] ? __do_softirq+0x80/0x102
[<ffffffff81404b8c>] ? call_softirq+0x1c/0x30
[<ffffffff81003044>] ? do_softirq+0x2c/0x60
[<ffffffff81028abc>] ? irq_exit+0x3a/0x91
[<ffffffff81002e91>] ? do_IRQ+0x81/0x97
[<ffffffff81403327>] ? common_interrupt+0x67/0x67
<EOI> [<ffffffff810079d8>] ? default_idle+0x1e/0x32
[<ffffffff81007afc>] ? amd_e400_idle+0xb7/0xd4
[<ffffffff810081a5>] ? cpu_idle+0x58/0x98
---[ end trace 7d5a7d21f604b0d8 ]---
r8169 0000:02:00.0: inside: link up
(The 17 local patches are to the PPS subsystem and nowhere near net net/
directory.)
The system is an AND Phenom X4 processor with 8G of ECC RAM, recently rebooted with
lspci -nn gives:
00:00.0 Host bridge [0600]: Advanced Micro Devices [AMD] nee ATI RD790 Northbridge only dual slot PCI-e_GFX and HT3 K8 part [1002:5956]
00:02.0 PCI bridge [0604]: Advanced Micro Devices [AMD] nee ATI RD790 PCI to PCI bridge (external gfx0 port A) [1002:5978]
00:05.0 PCI bridge [0604]: Advanced Micro Devices [AMD] nee ATI RD790 PCI to PCI bridge (PCI express gpp port B) [1002:597b]
00:09.0 PCI bridge [0604]: Advanced Micro Devices [AMD] nee ATI RD790 PCI to PCI bridge (PCI express gpp port E) [1002:597e]
00:12.0 SATA controller [0106]: Advanced Micro Devices [AMD] nee ATI SB600 Non-Raid-5 SATA [1002:4380]
00:13.0 USB controller [0c03]: Advanced Micro Devices [AMD] nee ATI SB600 USB (OHCI0) [1002:4387]
00:13.1 USB controller [0c03]: Advanced Micro Devices [AMD] nee ATI SB600 USB (OHCI1) [1002:4388]
00:13.2 USB controller [0c03]: Advanced Micro Devices [AMD] nee ATI SB600 USB (OHCI2) [1002:4389]
00:13.3 USB controller [0c03]: Advanced Micro Devices [AMD] nee ATI SB600 USB (OHCI3) [1002:438a]
00:13.4 USB controller [0c03]: Advanced Micro Devices [AMD] nee ATI SB600 USB (OHCI4) [1002:438b]
00:13.5 USB controller [0c03]: Advanced Micro Devices [AMD] nee ATI SB600 USB Controller (EHCI) [1002:4386]
00:14.0 SMBus [0c05]: Advanced Micro Devices [AMD] nee ATI SBx00 SMBus Controller [1002:4385] (rev 14)
00:14.1 IDE interface [0101]: Advanced Micro Devices [AMD] nee ATI SB600 IDE [1002:438c]
00:14.2 Audio device [0403]: Advanced Micro Devices [AMD] nee ATI SBx00 Azalia (Intel HDA) [1002:4383]
00:14.3 ISA bridge [0601]: Advanced Micro Devices [AMD] nee ATI SB600 PCI to LPC Bridge [1002:438d]
00:14.4 PCI bridge [0604]: Advanced Micro Devices [AMD] nee ATI SBx00 PCI to PCI Bridge [1002:4384]
00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration [1022:1200]
00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Address Map [1022:1201]
00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller [1022:1202]
00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control [1022:1203]
00:18.4 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Link Control [1022:1204]
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices [AMD] nee ATI RV370 5B60 [Radeon X300 (PCIE)] [1002:5b60]
01:00.1 Display controller [0380]: Advanced Micro Devices [AMD] nee ATI RV370 [Radeon X300SE] [1002:5b70]
02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller [10ec:8168] (rev 01)
03:00.0 RAID bus controller [0104]: Promise Technology, Inc. PDC42819 [FastTrak TX2650/TX4650] [105a:3f20]
04:00.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044] (rev c0)
04:02.0 Serial controller [0700]: Oxford Semiconductor Ltd OX16PCI954 (Quad 16950 UART) function 0 (Uart) [1415:9501]
04:02.1 Parallel controller [0701]: Oxford Semiconductor Ltd OX16PCI954 (Quad 16950 UART) function 1 (parallel port) [1415:9513]
04:03.0 PCI bridge [0604]: Digital Equipment Corporation DECchip 21152 [1011:0024] (rev 03)
05:04.0 Ethernet controller [0200]: Digital Equipment Corporation DECchip 21142/43 [1011:0019] (rev 41)
05:05.0 Ethernet controller [0200]: Digital Equipment Corporation DECchip 21142/43 [1011:0019] (rev 41)
05:06.0 Ethernet controller [0200]: Digital Equipment Corporation DECchip 21142/43 [1011:0019] (rev 41)
05:07.0 Ethernet controller [0200]: Digital Equipment Corporation DECchip 21142/43 [1011:0019] (rev 41)
The Tulip 100baseT ports are used for external and DMZ ports, while the RTL8169 is the main "inside"
gigabit network port.
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:02:00.0: irq 40 for MSI/MSI-X
r8169 0000:02:00.0: eth4: RTL8168b/8111b at 0xffffc90000020000, 00:21:85:16:51:7f, XID 18000000 IRQ 40
r8169 0000:02:00.0: eth4: jumbo features [frames: 4080 bytes, tx checksumming: ko]
It doesn't seem to be causing serious problems, but I presume the warning is there because
someone would like to know about it.
^ permalink raw reply
* Re: NETDEV WATCHDOG: %s (%s): transmit queue %u timed out
From: Dave Jones @ 2012-05-23 1:26 UTC (permalink / raw)
To: George Spelvin; +Cc: netdev, Fedora Kernel Team
In-Reply-To: <20120523010334.5778.qmail@science.horizon.com>
On Tue, May 22, 2012 at 09:03:34PM -0400, George Spelvin wrote:
> After installing a 3.4 kernel, I got the following:
>
> ------------[ cut here ]------------
> WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0xe9/0x15c()
> Hardware name: MS-7376
> NETDEV WATCHDOG: inside (r8169): transmit queue 0 timed out
> Pid: 0, comm: swapper/3 Not tainted 3.4.0-00017-g3df9c78 #152
> ..
> It doesn't seem to be causing serious problems, but I presume the warning is there because
> someone would like to know about it.
We seem to be seeing more and more of these (across a range of different NICs)
https://bugzilla.redhat.com/showdependencytree.cgi?id=702723&hide_resolved=1
has a bunch of duplicated bugs sorted into per-nic reports.
Dave
^ permalink raw reply
* Re: [RFC/PATCH] Bluetooth: prevent double l2cap_chan_destroy
From: Minho Ban @ 2012-05-23 1:39 UTC (permalink / raw)
To: Chanyeol Park
Cc: Gustavo Padovan, Marcel Holtmann, Johan Hedberg, David S. Miller,
linux-bluetooth, netdev, linux-kernel
In-Reply-To: <4FBB8544.3040408@samsung.com>
On 05/22/2012 09:23 PM, Chanyeol Park wrote:
> Hi
>
> On 2012년 05월 22일 11:50, Minho Ban wrote:
>> @@ -1343,10 +1343,10 @@ static void l2cap_conn_del(struct hci_conn *hcon, int err)
>> l2cap_chan_lock(chan);
>>
>> l2cap_chan_del(chan, err);
>> + chan->ops->close(chan->data);
>>
>> l2cap_chan_unlock(chan);
>>
>> - chan->ops->close(chan->data);
>> l2cap_chan_put(chan);
>> }
> I think this patch does not make sense Because inside chan->ops->close() "chan" could be freed in the l2cap_chan_destroy().
>
I agree, thanks for pointing out.
^ permalink raw reply
* Re: NETDEV WATCHDOG: %s (%s): transmit queue %u timed out
From: George Spelvin @ 2012-05-23 1:53 UTC (permalink / raw)
To: davej, linux; +Cc: kernel-team, netdev
In-Reply-To: <20120523012640.GB15255@redhat.com>
> We seem to be seeing more and more of these (across a range of different NICs)
> https://bugzilla.redhat.com/showdependencytree.cgi?id=702723&hide_resolved=1
> has a bunch of duplicated bugs sorted into per-nic reports.
Thank yoou for rhe response.
Unfortunately, the last reboot was on Feb. 26, and the kernel logs don't go back that far,
so I'm not sure if 3.3-rc5 reported this priblem or not.
^ permalink raw reply
* Re: [PATCH iproute2] tc_codel: Controlled Delay AQM
From: Vijay Subramanian @ 2012-05-23 2:05 UTC (permalink / raw)
To: Dave Taht; +Cc: Stephen Hemminger, Eric Dumazet, netdev, Dave Taht
In-Reply-To: <CAA93jw5oOz38UHsp2UODJcbn9uDz34RfPSNCE-qE-7BKoshLtA@mail.gmail.com>
On 22 May 2012 16:59, Dave Taht <dave.taht@gmail.com> wrote:
> I promised eric I'd write the two man pages.
>
> I also promised myself a vacation. The last year has been a long hard
> road. If someone
> hasn't got to writing the man pages by the time I get back from lupin,
> I'll write it.
>
Dave,
I can volunteer to write the man page. I have been looking at some of
the iproute2 code and the corresponding parts in the kernel and
writing the man page would be a good learning process I think. I will
try to come up with a version for you to review shortly.
Regards,
Vijay
^ permalink raw reply
* Re: [PATCH] if: restore token ring ARP type to header
From: Paul Gortmaker @ 2012-05-23 2:28 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20120522140105.284c28cd@nehalam.linuxnetplumber.net>
On Tue, May 22, 2012 at 5:01 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> Recent removal of Token Ring breaks the build of iproute2.
>
> Even though Token Ring support is gone from the kernel, it is worth
> keeping the the definition of the TR ARP type to avoid breaking
> userspace programs that use this file.
Thanks Stephen, I was trying to force errors in kernel builds by
doing these kinds of define changes while doing my delete tests.
But I was not doing userspace builds, however....
Looking back at the changes I've made, the only thing similar that
I can see is the ipx.h changes:
diff --git a/include/linux/ipx.h b/include/linux/ipx.h
index 3d48014..8f02439 100644
--- a/include/linux/ipx.h
+++ b/include/linux/ipx.h
@@ -38,7 +38,7 @@ struct ipx_interface_definition {
#define IPX_FRAME_8022 2
#define IPX_FRAME_ETHERII 3
#define IPX_FRAME_8023 4
-#define IPX_FRAME_TR_8022 5 /* obsolete */
+/* obsolete token ring was 5 */
unsigned char ipx_special;
#define IPX_SPECIAL_NONE 0
#define IPX_PRIMARY 1
Please let me/netdev know if the above also happens to trigger any
sort of userspace fallout, and we'll fix it up ASAP.
Thanks,
Paul.
^ permalink raw reply related
* Re: [PATCH] if: restore token ring ARP type to header
From: Stephen Hemminger @ 2012-05-23 2:34 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev
In-Reply-To: <CAP=VYLqMGeFvy8O_zLi+sEP0t6p5M8f_x7d0g1XGkLHdVUVzUg@mail.gmail.com>
On Tue, 22 May 2012 22:28:28 -0400
Paul Gortmaker <paul.gortmaker@windriver.com> wrote:
> On Tue, May 22, 2012 at 5:01 PM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> > Recent removal of Token Ring breaks the build of iproute2.
> >
> > Even though Token Ring support is gone from the kernel, it is worth
> > keeping the the definition of the TR ARP type to avoid breaking
> > userspace programs that use this file.
>
> Thanks Stephen, I was trying to force errors in kernel builds by
> doing these kinds of define changes while doing my delete tests.
>
> But I was not doing userspace builds, however....
>
> Looking back at the changes I've made, the only thing similar that
> I can see is the ipx.h changes:
>
> diff --git a/include/linux/ipx.h b/include/linux/ipx.h
> index 3d48014..8f02439 100644
> --- a/include/linux/ipx.h
> +++ b/include/linux/ipx.h
> @@ -38,7 +38,7 @@ struct ipx_interface_definition {
> #define IPX_FRAME_8022 2
> #define IPX_FRAME_ETHERII 3
> #define IPX_FRAME_8023 4
> -#define IPX_FRAME_TR_8022 5 /* obsolete */
> +/* obsolete token ring was 5 */
> unsigned char ipx_special;
> #define IPX_SPECIAL_NONE 0
> #define IPX_PRIMARY 1
>
> Please let me/netdev know if the above also happens to trigger any
> sort of userspace fallout, and we'll fix it up ASAP.
>
> Thanks,
> Paul.
Assume any header processed as part of 'make headers_install' is
part of kernel API and must not get broken. Therefor ipx.h
must be fixed as well. Local debug stuff should not be committed
to upstream repository!
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Yanmin Zhang @ 2012-05-23 3:02 UTC (permalink / raw)
To: David Miller; +Cc: kunx.jiang, netdev, linux-kernel
In-Reply-To: <20120522.151554.106838106733194160.davem@davemloft.net>
On Tue, 2012-05-22 at 15:15 -0400, David Miller wrote:
> From: "kun.jiang" <kunx.jiang@intel.com>
> Date: Tue, 22 May 2012 17:48:53 +0800
>
> > From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> >
> > We hit a kernel OOPS.
> ...
> > In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
> > nh_dev. kfree_rcu(fi, rcu) would release the whole area.
> >
> > Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> > Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
>
> This isn't a fix. You're keeping around a pointer to a completely
> released object, which is therefore illegal to dereference.
>
> That's why we must set it to NULL, to catch such illegal accesses.
Thanks for the comments. The network stack in kernel is complicated.
Questions:
1) Why does free_fib_info call call_rcu instead of releasing fi directly?
I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
If other cpu are accessing it, here resetting to NULL would cause other
cpu panic.
void free_fib_info(struct fib_info *fi)
{
if (fi->fib_dead == 0) {
pr_warn("Freeing alive fib_info %p\n", fi);
return;
}
change_nexthops(fi) {
if (nexthop_nh->nh_dev)
dev_put(nexthop_nh->nh_dev);
nexthop_nh->nh_dev = NULL;
} endfor_nexthops(fi);
fib_info_cnt--;
release_net(fi->fib_net);
call_rcu(&fi->rcu, free_fib_info_rcu); <=====rcu
}
2) dev_put is to decrease the counter and doesn't release nh_dev.
If 2) is incorrect, how about below solution? It delays the resetting
in rcu.
================
We hit a kernel OOPS.
<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416] [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357] [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764] [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024] [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955] [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466] [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450] [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312] [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730] [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261] [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960] [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834] [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224] [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817] [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538] [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111] [<c123925d>] ? sub_preempt_count+0x3d/0x50
Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the resetting.
Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
---
net/ipv4/fib_semantics.c | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5063fa3..56d8a97 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -145,6 +145,14 @@ static void free_fib_info_rcu(struct rcu_head *head)
{
struct fib_info *fi = container_of(head, struct fib_info, rcu);
+ change_nexthops(fi) {
+ if (nexthop_nh->nh_dev)
+ dev_put(nexthop_nh->nh_dev);
+ nexthop_nh->nh_dev = NULL;
+ } endfor_nexthops(fi);
+
+ fib_info_cnt--;
+ release_net(fi->fib_net);
if (fi->fib_metrics != (u32 *) dst_default_metrics)
kfree(fi->fib_metrics);
kfree(fi);
@@ -156,13 +164,6 @@ void free_fib_info(struct fib_info *fi)
pr_warn("Freeing alive fib_info %p\n", fi);
return;
}
- change_nexthops(fi) {
- if (nexthop_nh->nh_dev)
- dev_put(nexthop_nh->nh_dev);
- nexthop_nh->nh_dev = NULL;
- } endfor_nexthops(fi);
- fib_info_cnt--;
- release_net(fi->fib_net);
call_rcu(&fi->rcu, free_fib_info_rcu);
}
--
1.7.5.4
^ permalink raw reply related
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: David Miller @ 2012-05-23 3:23 UTC (permalink / raw)
To: yanmin_zhang; +Cc: kunx.jiang, netdev, linux-kernel
In-Reply-To: <1337742123.14538.175.camel@ymzhang.sh.intel.com>
From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date: Wed, 23 May 2012 11:02:03 +0800
> 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> If other cpu are accessing it, here resetting to NULL would cause other
> cpu panic.
Because fib trie lookups are done with RCU locking, therefore we must
use RCU freeing to release the object.
What I was trying to impart to you is that removing the NULL
assignment is wrong and that an alternative fix is warranted (hint:
consider moving something into the RCU release).
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Yanmin Zhang @ 2012-05-23 3:30 UTC (permalink / raw)
To: David Miller; +Cc: kunx.jiang, netdev, linux-kernel
In-Reply-To: <20120522.232310.911242148705021745.davem@davemloft.net>
On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date: Wed, 23 May 2012 11:02:03 +0800
>
> > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > If other cpu are accessing it, here resetting to NULL would cause other
> > cpu panic.
>
> Because fib trie lookups are done with RCU locking, therefore we must
> use RCU freeing to release the object.
>
> What I was trying to impart to you is that removing the NULL
> assignment is wrong and that an alternative fix is warranted (hint:
> consider moving something into the RCU release).
Thanks for the explanation.
How about the new patch posted in the end of previous reply? It does move the
the resetting to RCU release.
https://lkml.org/lkml/2012/5/22/558?
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: David Miller @ 2012-05-23 3:49 UTC (permalink / raw)
To: yanmin_zhang; +Cc: kunx.jiang, netdev, linux-kernel
In-Reply-To: <1337743841.14538.178.camel@ymzhang.sh.intel.com>
From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date: Wed, 23 May 2012 11:30:41 +0800
> How about the new patch posted in the end of previous reply? It does move the
> the resetting to RCU release.
> https://lkml.org/lkml/2012/5/22/558?
If you test it and submit it formally I'll probably apply it.
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Eric Dumazet @ 2012-05-23 4:37 UTC (permalink / raw)
To: David Miller; +Cc: yanmin_zhang, kunx.jiang, netdev, linux-kernel
In-Reply-To: <20120522.232310.911242148705021745.davem@davemloft.net>
On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date: Wed, 23 May 2012 11:02:03 +0800
>
> > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > If other cpu are accessing it, here resetting to NULL would cause other
> > cpu panic.
>
> Because fib trie lookups are done with RCU locking, therefore we must
> use RCU freeing to release the object.
>
> What I was trying to impart to you is that removing the NULL
> assignment is wrong and that an alternative fix is warranted (hint:
> consider moving something into the RCU release).
> --
Its more than that I'm afraid.
fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
Also the "fib_info_cnt--;" must stay in free_fib_info() (so that it is
protected by RTNL)
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Yanmin Zhang @ 2012-05-23 4:41 UTC (permalink / raw)
To: David Miller; +Cc: kunx.jiang, netdev, linux-kernel
In-Reply-To: <20120522.234920.1047532847473945281.davem@davemloft.net>
On Tue, 2012-05-22 at 23:49 -0400, David Miller wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date: Wed, 23 May 2012 11:30:41 +0800
>
> > How about the new patch posted in the end of previous reply? It does move the
> > the resetting to RCU release.
> > https://lkml.org/lkml/2012/5/22/558?
>
> If you test it and submit it formally I'll probably apply it.
We did test it. But it's hard to reproduce it. We hit it the issue
for a couple of times when running MTBF testing on Android smart phone.
Mostly, it happens after MTBF runs for more than 12 hours. To releasing
the product, MTBF testing should run for hundreds of hours. This bug
blocks it.
We would submit it formally and also run more testing.
Sorry for taking you too much time.
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Yanmin Zhang @ 2012-05-23 4:54 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, kunx.jiang, netdev, linux-kernel
In-Reply-To: <1337747829.3361.1599.camel@edumazet-glaptop>
On Wed, 2012-05-23 at 06:37 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> > From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> > Date: Wed, 23 May 2012 11:02:03 +0800
> >
> > > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > > If other cpu are accessing it, here resetting to NULL would cause other
> > > cpu panic.
> >
> > Because fib trie lookups are done with RCU locking, therefore we must
> > use RCU freeing to release the object.
> >
> > What I was trying to impart to you is that removing the NULL
> > assignment is wrong and that an alternative fix is warranted (hint:
> > consider moving something into the RCU release).
> > --
>
> Its more than that I'm afraid.
>
> fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
resetting to RCU protection.
>
> Also the "fib_info_cnt--;" must stay in free_fib_info() (so that it is
> protected by RTNL)
We would move "fib_info_cnt--;" back to free_fib_info.
Thanks,
Yanmin
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Eric Dumazet @ 2012-05-23 5:02 UTC (permalink / raw)
To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel
In-Reply-To: <1337748897.14538.184.camel@ymzhang.sh.intel.com>
On Wed, 2012-05-23 at 12:54 +0800, Yanmin Zhang wrote:
> > fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
> The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
> resetting to RCU protection.
Its not enough.
We must take care that all users are in a RCU protected region.
They might be already, but a full check is needed.
For example
net/ipv4/fib_trie.c:2563: fi->fib_dev ? fi->fib_dev->name : "*"
looks to be safe (because already in a rcu_read_lock())
But its not.
Right thing would be to do :
struct net_device *ndev = rcu_dereference(fi->fib_dev)
...
ndev ? ndev->name : "*"
^ permalink raw reply
* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Eric Dumazet @ 2012-05-23 5:08 UTC (permalink / raw)
To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel
In-Reply-To: <1337748069.14538.181.camel@ymzhang.sh.intel.com>
On Wed, 2012-05-23 at 12:41 +0800, Yanmin Zhang wrote:
> We did test it. But it's hard to reproduce it. We hit it the issue
> for a couple of times when running MTBF testing on Android smart phone.
> Mostly, it happens after MTBF runs for more than 12 hours. To releasing
> the product, MTBF testing should run for hundreds of hours. This bug
> blocks it.
>
I have an idea how to trigger this in my test machine.
> We would submit it formally and also run more testing.
>
> Sorry for taking you too much time.
Hey, you fix a bug, take your time and dont worry.
Thanks
^ permalink raw reply
* Re: [PATCH 2/3] drivers: net: ethernet: stmmac: fix failure in module test
From: Bob Liu @ 2012-05-23 5:09 UTC (permalink / raw)
To: Giuseppe CAVALLARO
Cc: davem, francesco.virlinzi, rayagond, sr, netdev,
uclinux-dist-devel
In-Reply-To: <4FBB8BBD.8010208@st.com>
Hi Peppe,
On Tue, May 22, 2012 at 8:51 PM, Giuseppe CAVALLARO
<peppe.cavallaro@st.com> wrote:
> On 5/22/2012 9:38 AM, Bob Liu wrote:
>> Module can't be compiled and installed in current Makefile.
>
> I've just seen that I can generate the stmmac.ko on ARM and also on x86
> with my configuration.
>
> Can you post here you build failure?
>
> In the meantime I'll test the module on ARM soon.
Yes, stmmac.ko can be generated, but stmmac_platform.c won't be built.
which cause net device can't be probed.
>
> peppe
>
>>
>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>> ---
>> drivers/net/ethernet/stmicro/stmmac/Makefile | 12 ++++++------
>> 1 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile
>> index bc965ac..4b50922 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/Makefile
>> +++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
>> @@ -1,10 +1,10 @@
>> obj-$(CONFIG_STMMAC_ETH) += stmmac.o
>> -stmmac-$(CONFIG_STMMAC_TIMER) += stmmac_timer.o
>> -stmmac-$(CONFIG_STMMAC_RING) += ring_mode.o
>> -stmmac-$(CONFIG_STMMAC_CHAINED) += chain_mode.o
>> -stmmac-$(CONFIG_STMMAC_PLATFORM) += stmmac_platform.o
>> -stmmac-$(CONFIG_STMMAC_PCI) += stmmac_pci.o
>> +stmmac-$(CONFIG_STMMAC_TIMER:m=y) += stmmac_timer.o
>> +stmmac-$(CONFIG_STMMAC_RING:m=y) += ring_mode.o
>> +stmmac-$(CONFIG_STMMAC_CHAINED:m=y) += chain_mode.o
>> +stmmac-$(CONFIG_STMMAC_PLATFORM:m=y) += stmmac_platform.o
>> +stmmac-$(CONFIG_STMMAC_PCI:m=y) += stmmac_pci.o
>> stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o \
>> dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
>> dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \
>> - mmc_core.o $(stmmac-y)
>> + mmc_core.o
>
--
Thanks,
--Bob
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox