* Re: netfilter question [not found] <cad49557-7c7a-83c9-d2b6-71d9624f0d52@miromedia.ca> @ 2016-11-16 13:33 ` Eric Dumazet 2016-11-16 15:02 ` Florian Westphal 0 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2016-11-16 13:33 UTC (permalink / raw) To: Eric Desrochers; +Cc: Florian Westphal, netfilter-devel On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote: > Hi Eric, > > My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression. > > I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies. > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present. > > So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too. > > > Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786 > > Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown. > > I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject. > > Thanks in advance. > > Regards, > > Eric > Hi Eric Thanks for your mail. But you should CC it on netfilter-devel mailing list. Key point is that we really care about fast path : packet processing. And cited commit helps this a lot by lowering the memory foot print on hosts with many cores. This is a step into right direction. Now we probably should batch the percpu allocations one page at a time, or ask Tejun if percpu allocations could be really really fast (probably much harder) But really you should not use iptables one rule at a time... This will never compete with iptables-restore. ;) Florian, would you have time to work on a patch trying to group the percpu allocations one page at a time ? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-16 13:33 ` netfilter question Eric Dumazet @ 2016-11-16 15:02 ` Florian Westphal 2016-11-16 15:23 ` Eric Dumazet 0 siblings, 1 reply; 10+ messages in thread From: Florian Westphal @ 2016-11-16 15:02 UTC (permalink / raw) To: Eric Dumazet; +Cc: Eric Desrochers, Florian Westphal, netfilter-devel Eric Dumazet <edumazet@google.com> wrote: > On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote: > > Hi Eric, > > > > My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression. > > > > I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies. > > > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present. > > > > So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too. > > > > > > Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786 > > > > Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown. > > > > I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject. > > > > Thanks in advance. [..] > Key point is that we really care about fast path : packet processing. > And cited commit helps this a lot by lowering the memory foot print on > hosts with many cores. > This is a step into right direction. > > Now we probably should batch the percpu allocations one page at a > time, or ask Tejun if percpu allocations could be really really fast > (probably much harder) > > But really you should not use iptables one rule at a time... > This will never compete with iptables-restore. ;) > > Florian, would you have time to work on a patch trying to group the > percpu allocations one page at a time ? You mean something like this ? : xt_entry_foreach(iter, entry0, newinfo->size) { - ret = find_check_entry(iter, net, repl->name, repl->size); - if (ret != 0) + if (pcpu_alloc == 0) { + pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct xt_counters)); + if (IS_ERR_VALUE(pcnt)) + BUG(); + } + + iter->counters.pcnt = pcnt + pcpu_alloc; + iter->counters.bcnt = !!pcpu_alloc; + pcpu_alloc += sizeof(struct xt_counters); + + if (pcpu_alloc > PAGE_SIZE - sizeof(struct xt_counters)) + pcpu_alloc = 0; + + ret = find_check_entry(iter, net, repl->name, repl->size) ... This is going to be ugly since we'll have to deal with SMP vs. NONSMP (i.e. no perpcu allocations) in ip/ip6/arptables. Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset should be free'd or not). But maybe I don't understand what you are suggesting :) Can you elaborate? Thanks! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-16 15:02 ` Florian Westphal @ 2016-11-16 15:23 ` Eric Dumazet 2016-11-17 0:07 ` Florian Westphal 0 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2016-11-16 15:23 UTC (permalink / raw) To: Florian Westphal; +Cc: Eric Dumazet, Eric Desrochers, netfilter-devel On Wed, 2016-11-16 at 16:02 +0100, Florian Westphal wrote: > Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote: > > > Hi Eric, > > > > > > My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression. > > > > > > I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies. > > > > > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present. > > > > > > So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too. > > > > > > > > > Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786 > > > > > > Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown. > > > > > > I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject. > > > > > > Thanks in advance. > > [..] > > > Key point is that we really care about fast path : packet processing. > > And cited commit helps this a lot by lowering the memory foot print on > > hosts with many cores. > > This is a step into right direction. > > > > Now we probably should batch the percpu allocations one page at a > > time, or ask Tejun if percpu allocations could be really really fast > > (probably much harder) > > > > But really you should not use iptables one rule at a time... > > This will never compete with iptables-restore. ;) > > > > Florian, would you have time to work on a patch trying to group the > > percpu allocations one page at a time ? > > You mean something like this ? : > xt_entry_foreach(iter, entry0, newinfo->size) { > - ret = find_check_entry(iter, net, repl->name, repl->size); > - if (ret != 0) > + if (pcpu_alloc == 0) { > + pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct xt_counters)); alignment should be a page. > + if (IS_ERR_VALUE(pcnt)) > + BUG(); well. no BUG() for sure ;) > + } > + > + iter->counters.pcnt = pcnt + pcpu_alloc; > + iter->counters.bcnt = !!pcpu_alloc; > + pcpu_alloc += sizeof(struct xt_counters); > + > + if (pcpu_alloc > PAGE_SIZE - sizeof(struct xt_counters)) > + pcpu_alloc = 0; > + > + ret = find_check_entry(iter, net, repl->name, repl->size) > ... > > This is going to be ugly since we'll have to deal with SMP vs. NONSMP (i.e. no perpcu allocations) > in ip/ip6/arptables. Time for a common helper then ... > > Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset should be free'd or not). Free if the address is aligned to a page boundary ? Otherwise skip it, it already has been freed earlier. > > But maybe I don't understand what you are suggesting :) > Can you elaborate? Note that this grouping will also help data locality. I definitely have servers with a huge number of percpu allocations and I fear we might have many TLB misses because of possible spread of xt_counters. Note that percpu pages must not be shared by multiple users (ip/ip6/arptable), each table should get its own cache. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-16 15:23 ` Eric Dumazet @ 2016-11-17 0:07 ` Florian Westphal 2016-11-17 2:34 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Florian Westphal @ 2016-11-17 0:07 UTC (permalink / raw) To: Eric Dumazet Cc: Florian Westphal, Eric Dumazet, Eric Desrochers, netfilter-devel Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote: > > > > Hi Eric, > > > > > > > > My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression. > > > > > > > > I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies. > > > > > > > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present. > > > > > > > > So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too. > > > > > > > > > > > > Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786 > > > > > > > > Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown. > > > > > > > > I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject. > > > > > > > > Thanks in advance. > > > > [..] > > > > > Key point is that we really care about fast path : packet processing. > > > And cited commit helps this a lot by lowering the memory foot print on > > > hosts with many cores. > > > This is a step into right direction. > > > > > > Now we probably should batch the percpu allocations one page at a > > > time, or ask Tejun if percpu allocations could be really really fast > > > (probably much harder) > > > > > > But really you should not use iptables one rule at a time... > > > This will never compete with iptables-restore. ;) > > > > > > Florian, would you have time to work on a patch trying to group the > > > percpu allocations one page at a time ? > > > > You mean something like this ? : > > xt_entry_foreach(iter, entry0, newinfo->size) { > > - ret = find_check_entry(iter, net, repl->name, repl->size); > > - if (ret != 0) > > + if (pcpu_alloc == 0) { > > + pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct xt_counters)); > > alignment should be a page. [..] > > Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset should be free'd or not). > > Free if the address is aligned to a page boundary ? Good idea. This seems to work for me. Eric (Desrochers), does this improve the situation for you as well? diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -403,38 +403,14 @@ static inline unsigned long ifname_compare_aligned(const char *_a, return ret; } +struct xt_percpu_counter_alloc_state { + unsigned int off; + const char *mem; +}; -/* On SMP, ip(6)t_entry->counters.pcnt holds address of the - * real (percpu) counter. On !SMP, its just the packet count, - * so nothing needs to be done there. - * - * xt_percpu_counter_alloc returns the address of the percpu - * counter, or 0 on !SMP. We force an alignment of 16 bytes - * so that bytes/packets share a common cache line. - * - * Hence caller must use IS_ERR_VALUE to check for error, this - * allows us to return 0 for single core systems without forcing - * callers to deal with SMP vs. NONSMP issues. - */ -static inline unsigned long xt_percpu_counter_alloc(void) -{ - if (nr_cpu_ids > 1) { - void __percpu *res = __alloc_percpu(sizeof(struct xt_counters), - sizeof(struct xt_counters)); - - if (res == NULL) - return -ENOMEM; - - return (__force unsigned long) res; - } - - return 0; -} -static inline void xt_percpu_counter_free(u64 pcnt) -{ - if (nr_cpu_ids > 1) - free_percpu((void __percpu *) (unsigned long) pcnt); -} +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state, + struct xt_counters *counter); +void xt_percpu_counter_free(struct xt_counters *cnt); static inline struct xt_counters * xt_get_this_cpu_counter(struct xt_counters *cnt) diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 39004da318e2..cbea0cb030da 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -411,17 +411,15 @@ static inline int check_target(struct arpt_entry *e, const char *name) } static inline int -find_check_entry(struct arpt_entry *e, const char *name, unsigned int size) +find_check_entry(struct arpt_entry *e, const char *name, unsigned int size, + struct xt_percpu_counter_alloc_state *alloc_state) { struct xt_entry_target *t; struct xt_target *target; - unsigned long pcnt; int ret; - pcnt = xt_percpu_counter_alloc(); - if (IS_ERR_VALUE(pcnt)) + if (!xt_percpu_counter_alloc(alloc_state, &e->counters)) return -ENOMEM; - e->counters.pcnt = pcnt; t = arpt_get_target(e); target = xt_request_find_target(NFPROTO_ARP, t->u.user.name, @@ -439,7 +437,7 @@ find_check_entry(struct arpt_entry *e, const char *name, unsigned int size) err: module_put(t->u.kernel.target->me); out: - xt_percpu_counter_free(e->counters.pcnt); + xt_percpu_counter_free(&e->counters); return ret; } @@ -519,7 +517,7 @@ static inline void cleanup_entry(struct arpt_entry *e) if (par.target->destroy != NULL) par.target->destroy(&par); module_put(par.target->me); - xt_percpu_counter_free(e->counters.pcnt); + xt_percpu_counter_free(&e->counters); } /* Checks and translates the user-supplied table segment (held in @@ -528,6 +526,7 @@ static inline void cleanup_entry(struct arpt_entry *e) static int translate_table(struct xt_table_info *newinfo, void *entry0, const struct arpt_replace *repl) { + struct xt_percpu_counter_alloc_state alloc_state = { 0 }; struct arpt_entry *iter; unsigned int *offsets; unsigned int i; @@ -590,7 +589,7 @@ static int translate_table(struct xt_table_info *newinfo, void *entry0, /* Finally, each sanity check must pass */ i = 0; xt_entry_foreach(iter, entry0, newinfo->size) { - ret = find_check_entry(iter, repl->name, repl->size); + ret = find_check_entry(iter, repl->name, repl->size, &alloc_state); if (ret != 0) break; ++i; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 46815c8a60d7..0024550516d1 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -531,7 +531,8 @@ static int check_target(struct ipt_entry *e, struct net *net, const char *name) static int find_check_entry(struct ipt_entry *e, struct net *net, const char *name, - unsigned int size) + unsigned int size, + struct xt_percpu_counter_alloc_state *alloc_state) { struct xt_entry_target *t; struct xt_target *target; @@ -539,12 +540,9 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name, unsigned int j; struct xt_mtchk_param mtpar; struct xt_entry_match *ematch; - unsigned long pcnt; - pcnt = xt_percpu_counter_alloc(); - if (IS_ERR_VALUE(pcnt)) + if (!xt_percpu_counter_alloc(alloc_state, &e->counters)) return -ENOMEM; - e->counters.pcnt = pcnt; j = 0; mtpar.net = net; @@ -582,7 +580,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name, cleanup_match(ematch, net); } - xt_percpu_counter_free(e->counters.pcnt); + xt_percpu_counter_free(&e->counters); return ret; } @@ -670,7 +668,7 @@ cleanup_entry(struct ipt_entry *e, struct net *net) if (par.target->destroy != NULL) par.target->destroy(&par); module_put(par.target->me); - xt_percpu_counter_free(e->counters.pcnt); + xt_percpu_counter_free(&e->counters); } /* Checks and translates the user-supplied table segment (held in @@ -679,6 +677,7 @@ static int translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, const struct ipt_replace *repl) { + struct xt_percpu_counter_alloc_state alloc_state = { 0 }; struct ipt_entry *iter; unsigned int *offsets; unsigned int i; @@ -738,7 +737,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, /* Finally, each sanity check must pass */ i = 0; xt_entry_foreach(iter, entry0, newinfo->size) { - ret = find_check_entry(iter, net, repl->name, repl->size); + ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state); if (ret != 0) break; ++i; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 6ff42b8301cc..123d9af6742e 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -562,7 +562,8 @@ static int check_target(struct ip6t_entry *e, struct net *net, const char *name) static int find_check_entry(struct ip6t_entry *e, struct net *net, const char *name, - unsigned int size) + unsigned int size, + struct xt_percpu_counter_alloc_state *alloc_state) { struct xt_entry_target *t; struct xt_target *target; @@ -570,12 +571,9 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name, unsigned int j; struct xt_mtchk_param mtpar; struct xt_entry_match *ematch; - unsigned long pcnt; - pcnt = xt_percpu_counter_alloc(); - if (IS_ERR_VALUE(pcnt)) + if (!xt_percpu_counter_alloc(alloc_state, &e->counters)) return -ENOMEM; - e->counters.pcnt = pcnt; j = 0; mtpar.net = net; @@ -612,7 +610,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name, cleanup_match(ematch, net); } - xt_percpu_counter_free(e->counters.pcnt); + xt_percpu_counter_free(&e->counters); return ret; } @@ -699,8 +697,7 @@ static void cleanup_entry(struct ip6t_entry *e, struct net *net) if (par.target->destroy != NULL) par.target->destroy(&par); module_put(par.target->me); - - xt_percpu_counter_free(e->counters.pcnt); + xt_percpu_counter_free(&e->counters); } /* Checks and translates the user-supplied table segment (held in @@ -709,6 +706,7 @@ static int translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, const struct ip6t_replace *repl) { + struct xt_percpu_counter_alloc_state alloc_state = { 0 }; struct ip6t_entry *iter; unsigned int *offsets; unsigned int i; @@ -768,7 +766,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, /* Finally, each sanity check must pass */ i = 0; xt_entry_foreach(iter, entry0, newinfo->size) { - ret = find_check_entry(iter, net, repl->name, repl->size); + ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state); if (ret != 0) break; ++i; diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index ad818e52859b..a4d1084b163f 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1615,6 +1615,59 @@ void xt_proto_fini(struct net *net, u_int8_t af) } EXPORT_SYMBOL_GPL(xt_proto_fini); +/** + * xt_percpu_counter_alloc - allocate x_tables rule counter + * + * @state: pointer to xt_percpu allocation state + * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct + * + * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then + * contain the address of the real (percpu) counter. + * + * Rule evaluation needs to use xt_get_this_cpu_counter() helper + * to fetch the real percpu counter. + * + * To speed up allocation and improve data locality, an entire + * page is allocated. + * + * xt_percpu_counter_alloc_state contains the base address of the + * allocated page and the current sub-offset. + * + * returns false on error. + */ +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state, + struct xt_counters *counter) +{ + BUILD_BUG_ON(PAGE_SIZE < (sizeof(*counter) * 2)); + + if (nr_cpu_ids <= 1) + return true; + + if (state->mem == NULL) { + state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE); + if (!state->mem) + return false; + } + counter->pcnt = (__force unsigned long)(state->mem + state->off); + state->off += sizeof(*counter); + if (state->off > (PAGE_SIZE - sizeof(*counter))) { + state->mem = NULL; + state->off = 0; + } + + return true; +} +EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc); + +void xt_percpu_counter_free(struct xt_counters *counters) +{ + unsigned long pcnt = counters->pcnt; + + if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0) + free_percpu((void __percpu *) (unsigned long)pcnt); +} +EXPORT_SYMBOL_GPL(xt_percpu_counter_free); + static int __net_init xt_net_init(struct net *net) { int i; ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-17 0:07 ` Florian Westphal @ 2016-11-17 2:34 ` Eric Dumazet 2016-11-17 15:49 ` Eric Desrochers 2016-11-20 6:33 ` Eric Dumazet 2 siblings, 0 replies; 10+ messages in thread From: Eric Dumazet @ 2016-11-17 2:34 UTC (permalink / raw) To: Florian Westphal; +Cc: Eric Dumazet, Eric Desrochers, netfilter-devel On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote: Seems very nice ! > + > +void xt_percpu_counter_free(struct xt_counters *counters) > +{ > + unsigned long pcnt = counters->pcnt; > + > + if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0) > + free_percpu((void __percpu *) (unsigned long)pcnt); > +} pcnt is already an "unsigned long" This packing might also speed up "iptables -nvL" dumps ;) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-17 0:07 ` Florian Westphal 2016-11-17 2:34 ` Eric Dumazet @ 2016-11-17 15:49 ` Eric Desrochers 2016-11-20 6:33 ` Eric Dumazet 2 siblings, 0 replies; 10+ messages in thread From: Eric Desrochers @ 2016-11-17 15:49 UTC (permalink / raw) To: Florian Westphal, Eric Dumazet; +Cc: Eric Dumazet, netfilter-devel Hi Florian, thanks for quick response, will give it a try and get back to you with the outcome of my test. On 2016-11-17 01:07 AM, Florian Westphal wrote: > Eric Dumazet <eric.dumazet@gmail.com> wrote: >>>> On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote: >>>>> Hi Eric, >>>>> >>>>> My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression. >>>>> >>>>> I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies. >>>>> >>>>> I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present. >>>>> >>>>> So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too. >>>>> >>>>> >>>>> Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786 >>>>> >>>>> Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown. >>>>> >>>>> I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject. >>>>> >>>>> Thanks in advance. >>> [..] >>> >>>> Key point is that we really care about fast path : packet processing. >>>> And cited commit helps this a lot by lowering the memory foot print on >>>> hosts with many cores. >>>> This is a step into right direction. >>>> >>>> Now we probably should batch the percpu allocations one page at a >>>> time, or ask Tejun if percpu allocations could be really really fast >>>> (probably much harder) >>>> >>>> But really you should not use iptables one rule at a time... >>>> This will never compete with iptables-restore. ;) >>>> >>>> Florian, would you have time to work on a patch trying to group the >>>> percpu allocations one page at a time ? >>> You mean something like this ? : >>> xt_entry_foreach(iter, entry0, newinfo->size) { >>> - ret = find_check_entry(iter, net, repl->name, repl->size); >>> - if (ret != 0) >>> + if (pcpu_alloc == 0) { >>> + pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct xt_counters)); >> alignment should be a page. > [..] > >>> Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset should be free'd or not). >> Free if the address is aligned to a page boundary ? > Good idea. This seems to work for me. Eric (Desrochers), does this > improve the situation for you as well? > > diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h > --- a/include/linux/netfilter/x_tables.h > +++ b/include/linux/netfilter/x_tables.h > @@ -403,38 +403,14 @@ static inline unsigned long ifname_compare_aligned(const char *_a, > return ret; > } > > +struct xt_percpu_counter_alloc_state { > + unsigned int off; > + const char *mem; > +}; > > -/* On SMP, ip(6)t_entry->counters.pcnt holds address of the > - * real (percpu) counter. On !SMP, its just the packet count, > - * so nothing needs to be done there. > - * > - * xt_percpu_counter_alloc returns the address of the percpu > - * counter, or 0 on !SMP. We force an alignment of 16 bytes > - * so that bytes/packets share a common cache line. > - * > - * Hence caller must use IS_ERR_VALUE to check for error, this > - * allows us to return 0 for single core systems without forcing > - * callers to deal with SMP vs. NONSMP issues. > - */ > -static inline unsigned long xt_percpu_counter_alloc(void) > -{ > - if (nr_cpu_ids > 1) { > - void __percpu *res = __alloc_percpu(sizeof(struct xt_counters), > - sizeof(struct xt_counters)); > - > - if (res == NULL) > - return -ENOMEM; > - > - return (__force unsigned long) res; > - } > - > - return 0; > -} > -static inline void xt_percpu_counter_free(u64 pcnt) > -{ > - if (nr_cpu_ids > 1) > - free_percpu((void __percpu *) (unsigned long) pcnt); > -} > +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state, > + struct xt_counters *counter); > +void xt_percpu_counter_free(struct xt_counters *cnt); > > static inline struct xt_counters * > xt_get_this_cpu_counter(struct xt_counters *cnt) > diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c > index 39004da318e2..cbea0cb030da 100644 > --- a/net/ipv4/netfilter/arp_tables.c > +++ b/net/ipv4/netfilter/arp_tables.c > @@ -411,17 +411,15 @@ static inline int check_target(struct arpt_entry *e, const char *name) > } > > static inline int > -find_check_entry(struct arpt_entry *e, const char *name, unsigned int size) > +find_check_entry(struct arpt_entry *e, const char *name, unsigned int size, > + struct xt_percpu_counter_alloc_state *alloc_state) > { > struct xt_entry_target *t; > struct xt_target *target; > - unsigned long pcnt; > int ret; > > - pcnt = xt_percpu_counter_alloc(); > - if (IS_ERR_VALUE(pcnt)) > + if (!xt_percpu_counter_alloc(alloc_state, &e->counters)) > return -ENOMEM; > - e->counters.pcnt = pcnt; > > t = arpt_get_target(e); > target = xt_request_find_target(NFPROTO_ARP, t->u.user.name, > @@ -439,7 +437,7 @@ find_check_entry(struct arpt_entry *e, const char *name, unsigned int size) > err: > module_put(t->u.kernel.target->me); > out: > - xt_percpu_counter_free(e->counters.pcnt); > + xt_percpu_counter_free(&e->counters); > > return ret; > } > @@ -519,7 +517,7 @@ static inline void cleanup_entry(struct arpt_entry *e) > if (par.target->destroy != NULL) > par.target->destroy(&par); > module_put(par.target->me); > - xt_percpu_counter_free(e->counters.pcnt); > + xt_percpu_counter_free(&e->counters); > } > > /* Checks and translates the user-supplied table segment (held in > @@ -528,6 +526,7 @@ static inline void cleanup_entry(struct arpt_entry *e) > static int translate_table(struct xt_table_info *newinfo, void *entry0, > const struct arpt_replace *repl) > { > + struct xt_percpu_counter_alloc_state alloc_state = { 0 }; > struct arpt_entry *iter; > unsigned int *offsets; > unsigned int i; > @@ -590,7 +589,7 @@ static int translate_table(struct xt_table_info *newinfo, void *entry0, > /* Finally, each sanity check must pass */ > i = 0; > xt_entry_foreach(iter, entry0, newinfo->size) { > - ret = find_check_entry(iter, repl->name, repl->size); > + ret = find_check_entry(iter, repl->name, repl->size, &alloc_state); > if (ret != 0) > break; > ++i; > diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c > index 46815c8a60d7..0024550516d1 100644 > --- a/net/ipv4/netfilter/ip_tables.c > +++ b/net/ipv4/netfilter/ip_tables.c > @@ -531,7 +531,8 @@ static int check_target(struct ipt_entry *e, struct net *net, const char *name) > > static int > find_check_entry(struct ipt_entry *e, struct net *net, const char *name, > - unsigned int size) > + unsigned int size, > + struct xt_percpu_counter_alloc_state *alloc_state) > { > struct xt_entry_target *t; > struct xt_target *target; > @@ -539,12 +540,9 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name, > unsigned int j; > struct xt_mtchk_param mtpar; > struct xt_entry_match *ematch; > - unsigned long pcnt; > > - pcnt = xt_percpu_counter_alloc(); > - if (IS_ERR_VALUE(pcnt)) > + if (!xt_percpu_counter_alloc(alloc_state, &e->counters)) > return -ENOMEM; > - e->counters.pcnt = pcnt; > > j = 0; > mtpar.net = net; > @@ -582,7 +580,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name, > cleanup_match(ematch, net); > } > > - xt_percpu_counter_free(e->counters.pcnt); > + xt_percpu_counter_free(&e->counters); > > return ret; > } > @@ -670,7 +668,7 @@ cleanup_entry(struct ipt_entry *e, struct net *net) > if (par.target->destroy != NULL) > par.target->destroy(&par); > module_put(par.target->me); > - xt_percpu_counter_free(e->counters.pcnt); > + xt_percpu_counter_free(&e->counters); > } > > /* Checks and translates the user-supplied table segment (held in > @@ -679,6 +677,7 @@ static int > translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, > const struct ipt_replace *repl) > { > + struct xt_percpu_counter_alloc_state alloc_state = { 0 }; > struct ipt_entry *iter; > unsigned int *offsets; > unsigned int i; > @@ -738,7 +737,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, > /* Finally, each sanity check must pass */ > i = 0; > xt_entry_foreach(iter, entry0, newinfo->size) { > - ret = find_check_entry(iter, net, repl->name, repl->size); > + ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state); > if (ret != 0) > break; > ++i; > diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c > index 6ff42b8301cc..123d9af6742e 100644 > --- a/net/ipv6/netfilter/ip6_tables.c > +++ b/net/ipv6/netfilter/ip6_tables.c > @@ -562,7 +562,8 @@ static int check_target(struct ip6t_entry *e, struct net *net, const char *name) > > static int > find_check_entry(struct ip6t_entry *e, struct net *net, const char *name, > - unsigned int size) > + unsigned int size, > + struct xt_percpu_counter_alloc_state *alloc_state) > { > struct xt_entry_target *t; > struct xt_target *target; > @@ -570,12 +571,9 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name, > unsigned int j; > struct xt_mtchk_param mtpar; > struct xt_entry_match *ematch; > - unsigned long pcnt; > > - pcnt = xt_percpu_counter_alloc(); > - if (IS_ERR_VALUE(pcnt)) > + if (!xt_percpu_counter_alloc(alloc_state, &e->counters)) > return -ENOMEM; > - e->counters.pcnt = pcnt; > > j = 0; > mtpar.net = net; > @@ -612,7 +610,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name, > cleanup_match(ematch, net); > } > > - xt_percpu_counter_free(e->counters.pcnt); > + xt_percpu_counter_free(&e->counters); > > return ret; > } > @@ -699,8 +697,7 @@ static void cleanup_entry(struct ip6t_entry *e, struct net *net) > if (par.target->destroy != NULL) > par.target->destroy(&par); > module_put(par.target->me); > - > - xt_percpu_counter_free(e->counters.pcnt); > + xt_percpu_counter_free(&e->counters); > } > > /* Checks and translates the user-supplied table segment (held in > @@ -709,6 +706,7 @@ static int > translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, > const struct ip6t_replace *repl) > { > + struct xt_percpu_counter_alloc_state alloc_state = { 0 }; > struct ip6t_entry *iter; > unsigned int *offsets; > unsigned int i; > @@ -768,7 +766,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, > /* Finally, each sanity check must pass */ > i = 0; > xt_entry_foreach(iter, entry0, newinfo->size) { > - ret = find_check_entry(iter, net, repl->name, repl->size); > + ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state); > if (ret != 0) > break; > ++i; > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index ad818e52859b..a4d1084b163f 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1615,6 +1615,59 @@ void xt_proto_fini(struct net *net, u_int8_t af) > } > EXPORT_SYMBOL_GPL(xt_proto_fini); > > +/** > + * xt_percpu_counter_alloc - allocate x_tables rule counter > + * > + * @state: pointer to xt_percpu allocation state > + * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct > + * > + * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then > + * contain the address of the real (percpu) counter. > + * > + * Rule evaluation needs to use xt_get_this_cpu_counter() helper > + * to fetch the real percpu counter. > + * > + * To speed up allocation and improve data locality, an entire > + * page is allocated. > + * > + * xt_percpu_counter_alloc_state contains the base address of the > + * allocated page and the current sub-offset. > + * > + * returns false on error. > + */ > +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state, > + struct xt_counters *counter) > +{ > + BUILD_BUG_ON(PAGE_SIZE < (sizeof(*counter) * 2)); > + > + if (nr_cpu_ids <= 1) > + return true; > + > + if (state->mem == NULL) { > + state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE); > + if (!state->mem) > + return false; > + } > + counter->pcnt = (__force unsigned long)(state->mem + state->off); > + state->off += sizeof(*counter); > + if (state->off > (PAGE_SIZE - sizeof(*counter))) { > + state->mem = NULL; > + state->off = 0; > + } > + > + return true; > +} > +EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc); > + > +void xt_percpu_counter_free(struct xt_counters *counters) > +{ > + unsigned long pcnt = counters->pcnt; > + > + if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0) > + free_percpu((void __percpu *) (unsigned long)pcnt); > +} > +EXPORT_SYMBOL_GPL(xt_percpu_counter_free); > + > static int __net_init xt_net_init(struct net *net) > { > int i; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-17 0:07 ` Florian Westphal 2016-11-17 2:34 ` Eric Dumazet 2016-11-17 15:49 ` Eric Desrochers @ 2016-11-20 6:33 ` Eric Dumazet [not found] ` <CAGUFhKwQTRRJpfGi2fRkFfGdpLYMN-2F9G+dEsavM7UGbkjjdA@mail.gmail.com> 2 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2016-11-20 6:33 UTC (permalink / raw) To: Florian Westphal; +Cc: Eric Dumazet, Eric Desrochers, netfilter-devel On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote: > + if (state->mem == NULL) { > + state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE); > + if (!state->mem) > + return false; > + } This will fail on arches where PAGE_SIZE=65536 percpu allocator limit is PCPU_MIN_UNIT_SIZE ( 32 KB ) So maybe use a smaller value like 4096 ? #define XT_PCPU_BLOC_SIZE 4096 ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <CAGUFhKwQTRRJpfGi2fRkFfGdpLYMN-2F9G+dEsavM7UGbkjjdA@mail.gmail.com>]
* Re: netfilter question [not found] ` <CAGUFhKwQTRRJpfGi2fRkFfGdpLYMN-2F9G+dEsavM7UGbkjjdA@mail.gmail.com> @ 2016-11-20 17:31 ` Eric Dumazet 2016-11-20 17:55 ` Eric Dumazet 0 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2016-11-20 17:31 UTC (permalink / raw) To: Eric D; +Cc: Eric Dumazet, netfilter-devel, Florian Westphal On Sun, 2016-11-20 at 12:22 -0500, Eric D wrote: > I'm currently abroad for work and will come back home soon. I will > test the solution and provide feedback to Florian by end of week. > > Thanks for jumping on this quickly. > > Eric > > > On Nov 20, 2016 7:33 AM, "Eric Dumazet" <eric.dumazet@gmail.com> > wrote: > On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote: > > > + if (state->mem == NULL) { > > + state->mem = __alloc_percpu(PAGE_SIZE, > PAGE_SIZE); > > + if (!state->mem) > > + return false; > > + } > > This will fail on arches where PAGE_SIZE=65536 > > percpu allocator limit is PCPU_MIN_UNIT_SIZE ( 32 KB ) > > So maybe use a smaller value like 4096 ? > > #define XT_PCPU_BLOC_SIZE 4096 > Thanks Eric, I will test the patch myself, because I believe we need it asap ;) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: netfilter question 2016-11-20 17:31 ` Eric Dumazet @ 2016-11-20 17:55 ` Eric Dumazet 0 siblings, 0 replies; 10+ messages in thread From: Eric Dumazet @ 2016-11-20 17:55 UTC (permalink / raw) To: Eric D; +Cc: Eric Dumazet, netfilter-devel, Florian Westphal On Sun, 2016-11-20 at 09:31 -0800, Eric Dumazet wrote: > Thanks Eric, I will test the patch myself, because I believe we need it > asap ;) Current net-next without Florian patch : lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done real 0m12.856s user 0m0.590s sys 0m11.131s perf report ...; perf report -> 47.45% iptables [kernel.kallsyms] [k] pcpu_alloc_area 8.49% iptables [kernel.kallsyms] [k] memset_erms 7.35% iptables [kernel.kallsyms] [k] get_counters 2.87% iptables [kernel.kallsyms] [k] __memmove 2.33% iptables [kernel.kallsyms] [k] pcpu_alloc 2.07% iptables [kernel.kallsyms] [k] _find_next_bit.part.0 1.62% iptables xtables-multi [.] 0x000000000001bb9d 1.25% iptables [kernel.kallsyms] [k] page_fault 1.01% iptables [kernel.kallsyms] [k] memcmp 0.94% iptables [kernel.kallsyms] [k] translate_table 0.76% iptables [kernel.kallsyms] [k] find_next_bit 0.73% iptables [kernel.kallsyms] [k] filemap_map_pages 0.68% iptables [kernel.kallsyms] [k] copy_user_enhanced_fast_string 0.54% iptables [kernel.kallsyms] [k] __get_user_8 0.54% iptables [kernel.kallsyms] [k] clear_page_c_e After patch : lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done real 0m3.867s user 0m0.559s sys 0m2.216s 22.15% iptables [kernel.kallsyms] [k] get_counters 5.85% iptables xtables-multi [.] 0x000000000001bbac 3.99% iptables [kernel.kallsyms] [k] page_fault 2.37% iptables [kernel.kallsyms] [k] memcmp 2.19% iptables [kernel.kallsyms] [k] copy_user_enhanced_fast_string 1.89% iptables [kernel.kallsyms] [k] translate_table 1.78% iptables [kernel.kallsyms] [k] memset_erms 1.74% iptables [kernel.kallsyms] [k] clear_page_c_e 1.73% iptables [kernel.kallsyms] [k] __get_user_8 1.72% iptables [kernel.kallsyms] [k] perf_iterate_ctx 1.21% iptables [kernel.kallsyms] [k] handle_mm_fault 0.98% iptables [kernel.kallsyms] [k] unmap_page_range So this is a huge win. And I suspect data path will also gain from all pcpu counters being in the same area of memory (this is where I am very interested) ^ permalink raw reply [flat|nested] 10+ messages in thread
* netfilter & ipv6 @ 2005-02-08 7:50 Jonas Berlin [not found] ` <53965.213.236.112.75.1107867276.squirrel@213.236.112.75> 0 siblings, 1 reply; 10+ messages in thread From: Jonas Berlin @ 2005-02-08 7:50 UTC (permalink / raw) To: netfilter-devel Hi! I have found that there seems to be a bunch of match & target modules for iptables that don't exist for ip6tables.. Like CLASSIFY for example. And some that exist have only a subset of the features. Also some modules exist only for linux 2.4 but were never ported to 2.6 or don't compile cleanly anymore on the newest 2.6 kernels. While I don't have that much experience of 2.6 yet and only have been hacking some smaller modules for 2.4 in the past, I think I should have enough skills to at least port/update some of these. So I thought maybe I could help out with this task - ipv6 is something that I hope and think will take over ipv4 some day and I'd like for linux to be able to provide the same experience then as ipv4 users of iptables have today. I have some questions.. 1. Some modules (will) look almost identical on ipv4 on and ipv6 - would there be any point in making the ipv4 modules export some symbols and then re-use those methods in ipv6 if it wouldn't imply any (major) changes to the ipv4 modules? 2. I think it could be nice to keep track of what modules exist for ipv4 and ipv6 on some web page and maybe also let people sign up to join the porting effort (if there are any =). What do you think? And is that something I could help out with as well? I have a server at home that could run it, however behind an 0,5M adsl line.. 3. If I succeed porting something, what should I name the patch to be sent to patch-o-matic-ng? I don't think I could just extend the existing ones as that would suggest the original authors have something to do with it and also some of them also are pending or already included in the ekernel. So, would CLASSIFY_v6 be an acceptable name for the ipv6 version of CLASSIFY? And maybe condition_26 for the 2.6 port of condition from 2.4? What's your general procedure if someone have updates to a module that someone else did? Should the original authors be contacted instead of sending patches to this list? Any suggestions you have are welcome! -- - xkr47 ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <53965.213.236.112.75.1107867276.squirrel@213.236.112.75>]
* ULOG target for ipv6 [not found] ` <53965.213.236.112.75.1107867276.squirrel@213.236.112.75> @ 2005-02-10 23:15 ` Jonas Berlin 2005-02-11 22:10 ` netfilter question Pedro Fortuna 0 siblings, 1 reply; 10+ messages in thread From: Jonas Berlin @ 2005-02-10 23:15 UTC (permalink / raw) To: Martijn Lievaart, netfilter-devel [-- Attachment #1: Type: text/plain, Size: 678 bytes --] Martijn Lievaart wrote: >What I mostly miss in ipv6 support is conntrack and ULOG. > > Here's my first attempt of ULOG for ipv6. It's _untested_, but it compiles at least on 2.6 and it patches fine against 2.4.28 also. As a side note - I have no idea how a (the) userspace daemon will react to getting ipv6 packets. :) Feel free to send me any problems you encounter with the kernel module or iptables extension though. Good luck :) -------- Someone could also point me to the latest instructions on how to contribute - the online docs I found here didn't work very well: http://netfilter.org/documentation/HOWTO//netfilter-extensions-HOWTO-9.html -- - xkr47 [-- Attachment #2: ipv6-ULOG-1.tar.bz2 --] [-- Type: application/octet-stream, Size: 5806 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* netfilter question 2005-02-10 23:15 ` ULOG target for ipv6 Jonas Berlin @ 2005-02-11 22:10 ` Pedro Fortuna 0 siblings, 0 replies; 10+ messages in thread From: Pedro Fortuna @ 2005-02-11 22:10 UTC (permalink / raw) To: netfilter-devel Hello guys. I'll try to make it as short an simple as I can. I want to develop a kernel module which will be running in two linux hosts, connected by a crossover network cable (ethernet). This kernel module will intercept a specific type of traffic (as an example, let's say FTP packets (encapsulated in DIX frames)), both incomming and outgoing, and change the ethertype in the frame header. Outgoing dix frames carrying FTP packets get their ethertype changed to a private, non standard ethertype number, just before they leave the host (i.e. before they are passed to the network driver). The frame is intercepted with the NF_IP_POST_ROUTING hook. Incoming dix frames carrying FTP packets are get their ethertype changed (at this point, a non standard ethertype number) to the standard IPv4 ethertype number (i.e. 0x800), just after they are processed by the network driver. The frame is intercepted with the NF_IP_PRE_ROUTING hook. My doubt is: I'm not sure if I will be able to intercept the incoming frames because they have a non standard ethertype number. They might get dropped before passing through the NF_IP_PRE_ROUTING hook, due to the unrecognized ethertype number. Is this true or false? If the frame passes the hook before trying to identify the packet type, then I'll have no trouble, because my netfilter module changes the frame to the original ethertype number, thus making the hole process transparent to the TCP/IP stacks running in both hosts. I could explain what the hell I need to this for, but then you would have three times more text to read :P I tried to restrict this post to a minimum-painless-size. Regards, -Pedro Fortuna ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-11-20 17:55 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <cad49557-7c7a-83c9-d2b6-71d9624f0d52@miromedia.ca> 2016-11-16 13:33 ` netfilter question Eric Dumazet 2016-11-16 15:02 ` Florian Westphal 2016-11-16 15:23 ` Eric Dumazet 2016-11-17 0:07 ` Florian Westphal 2016-11-17 2:34 ` Eric Dumazet 2016-11-17 15:49 ` Eric Desrochers 2016-11-20 6:33 ` Eric Dumazet [not found] ` <CAGUFhKwQTRRJpfGi2fRkFfGdpLYMN-2F9G+dEsavM7UGbkjjdA@mail.gmail.com> 2016-11-20 17:31 ` Eric Dumazet 2016-11-20 17:55 ` Eric Dumazet 2005-02-08 7:50 netfilter & ipv6 Jonas Berlin [not found] ` <53965.213.236.112.75.1107867276.squirrel@213.236.112.75> 2005-02-10 23:15 ` ULOG target for ipv6 Jonas Berlin 2005-02-11 22:10 ` netfilter question Pedro Fortuna
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).