Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/3] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Jan Engelhardt @ 2009-09-02 15:49 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Hannes Eder, lvs-devel, linux-kernel, netdev, netfilter-devel,
	Fabien Duchêne, Jean-Luc Fortemaison, Julian Anastasov,
	Julius Volz, Laurent Grawet, Simon Horman, Wensong Zhang
In-Reply-To: <4A9E90E4.9080805@trash.net>


On Wednesday 2009-09-02 17:36, Patrick McHardy wrote:
>> 
>> Nice, I'll use par->family.
>> 
>> So in theory I do not even need a check like the following in the beginning?
>> 
>> 	if (family != NFPROTO_IPV4
>> #ifdef CONFIG_IP_VS_IPV6
>> 	    && family != NFPROTO_IPV6
>> #endif
>> 		) {
>> 		match = false;
>> 		goto out;
>> 	}
>
>With the AF_UNSPEC registration of your match, it might be used

par->family always contains the NFPROTO of the invoking implementation,
which can never be UNSPEC (except, in future, xtables2 ;-)

par->match->family however may be UNSPEC if the module works that way.
Which is why we have par->family.

^ permalink raw reply

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
From: Christoph Lameter @ 2009-09-02 19:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, sri, dlstevens, netdev, niv, mtk.manpages
In-Reply-To: <4A9E849A.30105@gmail.com>


The patch is smaller if you remove the handling of recverr completely from
ip_push_pending_frames() and return NET_RX_DROP etc. Two of the callers
never even inspect the return code. For them this is useless processing.

The others could handle the processing of recverr on their own. Doing so
voids adding code to ip_push_pending_frames() which is latency critical
and also avoids changing the calling conventions.

I have a draft here from our earlier disucssions but its not as
comprehensive as yours.



^ permalink raw reply

* Re: Receive side performance issue with multi-10-GigE and NUMA
From: Bill Fink @ 2009-09-02 15:38 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin
In-Reply-To: <20090902104915.GA402@hmsreliant.think-freely.org>

On Wed, 2 Sep 2009, Neil Horman wrote:

> On Wed, Sep 02, 2009 at 01:11:43AM -0400, Bill Fink wrote:
> > On Thu, 27 Aug 2009, Neil Horman wrote:
> > 
> > > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote:
> > > > On Wed, 26 Aug 2009, Neil Horman wrote:
> > > > 
> > > > > Here  you go, I think this will fix your oops.
> > > > > 
> > > > > 
> > > > >     Fix NULL pointer deref in skb sources ftracer
> > > > >     
> > > > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > > > >     we can pass it to sock_net
> > > > >     
> > > > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > 
> > > > >  trace_skb_sources.c |    6 ++++--
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > > > > index 40eb071..8bf518f 100644
> > > > > --- a/kernel/trace/trace_skb_sources.c
> > > > > +++ b/kernel/trace/trace_skb_sources.c
> > > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > > >  	struct ring_buffer_event *event;
> > > > >  	struct trace_skb_event *entry;
> > > > >  	struct trace_array *tr = skb_trace;
> > > > > -	struct net_device *dev;
> > > > > +	struct net_device *dev = NULL;
> > > > >  
> > > > >  	if (!trace_skb_source_enabled)
> > > > >  		return;
> > > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > > >  	entry->event_data.rx_queue = skb->queue_mapping;
> > > > >  	entry->event_data.ccpu = smp_processor_id();
> > > > >  
> > > > > -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > > +	if (skb->sk)
> > > > > +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > > +
> > > > >  	if (dev) {
> > > > >  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> > > > >  		dev_put(dev);
> > > > 
> > > > 
> > > > 
> > > > On the positive side, it did fix the oops.  But the results of the
> > > > skb_sources tracing was not that useful.
> > > > 
> > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
> > > >  5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
> > > > n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT
> > > > 
> > > > First off, only 10 trace entries were made:
> > > > 
> > > > [root@xeontest1 tracing]# wc trace
> > > > 14 90 334 trace
> > > > 
> > > > And here they are:
> > > > 
> > > > [root@xeontest1 tracing]# cat trace
> > > > # tracer: skb_sources
> > > > #
> > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > #        |       |       |       |       |       |       |
> > > >         5521    0       0       Unknown 0       3       888
> > > >         5521    0       0       Unknown 0       3       896
> > > >         5521    0       0       Unknown 0       3       20
> > > >         5521    0       0       Unknown 0       3       888
> > > >         5521    0       0       Unknown 0       3       896
> > > >         5521    0       0       Unknown 0       3       20
> > > >         5521    1       1       Unknown 0       4       20
> > > >         5521    1       1       Unknown 0       4       11
> > > >         5521    1       1       Unknown 0       4       540
> > > >         5521    1       1       Unknown 0       4       0
> > > > 
> > > > Even for these 10 entries, why is the IFC Unknown, and the LENs
> > > > seem to be wrong too.
> > > > 
> > > > 						-Bill
> > > > 
> > > I'm not sure why you're getting Unknown Interface names.  Nominally that
> > > indicates that the skb->iif value in the skb was incorrect or otherwise not set,
> > > which shouldn't be the case.  As for the lengths that just seems wrong.  That
> > > length value is taken directly from skb->len, so if its not right, it seems like
> > > its not getting set correctly someplace.
> > > 
> > > As you may have seen we're removing the ftrace module, and replacing it with the
> > > use of raw trace events.  When I have that working, I'll see if I get simmilar
> > > results.  I never did in my local testing of the ftrace module, but perhaps its
> > > related to load or something.
> > 
> > IIUC I should keep the first of your original three ftrace patches,
> > revert all the rest, and then apply your very latest patch that
> > augments the skb_copy_datagram_iovec TRACE_EVENT.  Do I have that
> > basically correct?
> > 
> Thats exactly correct, yes.
> 
> > Then I just need to ask how do I use this new method?
> > 
> It works in basically the same way.  Except instead of doing this:
> echo skb_ftracer > /sys/kernel/debug/tracing/current_tracer
> you do this:
> echo 1 > /sys/kernel/debug/tracing/events/skb/skb_copy_datagram_iovec/enable
> Then the events should should up in /sys/kernel/debug/tracing/trace[_pipe]

Thanks!  I'll probably give this a try later today and report back.

						-Bill

^ permalink raw reply

* Re: [PATCH 1/3] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Patrick McHardy @ 2009-09-02 15:36 UTC (permalink / raw)
  To: Hannes Eder
  Cc: lvs-devel, linux-kernel, netdev, netfilter-devel,
	Fabien Duchêne, Jan Engelhardt, Jean-Luc Fortemaison,
	Julian Anastasov, Julius Volz, Laurent Grawet, Simon Horman,
	Wensong Zhang
In-Reply-To: <b5ddba180909020833w1739bd54t11fded8150007abc@mail.gmail.com>

Hannes Eder wrote:
> On Wed, Sep 2, 2009 at 16:54, Patrick McHardy<kaber@trash.net> wrote:
>> Hannes Eder wrote:
>>> This implements the kernel-space side of the netfilter matcher
>>> xt_ipvs.
>> Looks mostly fine to me, just one question:
>>
>>> +bool ipvs_mt(const struct sk_buff *skb, const struct xt_match_param *par)
>>> +{
>>> +     const struct xt_ipvs *data = par->matchinfo;
>>> +     const u_int8_t family = par->family;
>>> +     struct ip_vs_iphdr iph;
>>> +     struct ip_vs_protocol *pp;
>>> +     struct ip_vs_conn *cp;
>>> +     int af;
>>> +     bool match = true;
>>> +
>>> +     if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
>>> +             match = skb->ipvs_property ^
>>> +                     !!(data->invert & XT_IPVS_IPVS_PROPERTY);
>>> +             goto out;
>>> +     }
>>> +
>>> +     /* other flags than XT_IPVS_IPVS_PROPERTY are set */
>>> +     if (!skb->ipvs_property) {
>>> +             match = false;
>>> +             goto out;
>>> +     }
>>> +
>>> +     switch (skb->protocol) {
>>> +     case  htons(ETH_P_IP):
>>> +             af = AF_INET;
>>> +             break;
>>> +#ifdef CONFIG_IP_VS_IPV6
>>> +     case  htons(ETH_P_IPV6):
>>> +             af = AF_INET6;
>>> +             break;
>>> +#endif
>>> +     default:
>>> +             match = false;
>>> +             goto out;
>>> +     }
>> In the NF_INET_LOCAL_OUT hook skb->protocol is invalid. So if you
>> don't need this, it would make sense to restrict the match to the
>> other hooks.
>>
>> Even easier would be to use par->family, which contains the address
>> family and doesn't need any translation.
> 
> Nice, I'll use par->family.
> 
> So in theory I do not even need a check like the following in the beginning?
> 
> 	if (family != NFPROTO_IPV4
> #ifdef CONFIG_IP_VS_IPV6
> 	    && family != NFPROTO_IPV6
> #endif
> 		) {
> 		match = false;
> 		goto out;
> 	}

With the AF_UNSPEC registration of your match, it might be used
with different families. But you could add two seperate IPV4/IPV6
registrations or catch an invalid family in ->checkentry() and
remove the runtime check.

^ permalink raw reply

* Re: [PATCH 1/3] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Hannes Eder @ 2009-09-02 15:33 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: lvs-devel, linux-kernel, netdev, netfilter-devel,
	Fabien Duchêne, Jan Engelhardt, Jean-Luc Fortemaison,
	Julian Anastasov, Julius Volz, Laurent Grawet, Simon Horman,
	Wensong Zhang
In-Reply-To: <4A9E8711.1070807@trash.net>

On Wed, Sep 2, 2009 at 16:54, Patrick McHardy<kaber@trash.net> wrote:
> Hannes Eder wrote:
>> This implements the kernel-space side of the netfilter matcher
>> xt_ipvs.
>
> Looks mostly fine to me, just one question:
>
>> +bool ipvs_mt(const struct sk_buff *skb, const struct xt_match_param *par)
>> +{
>> +     const struct xt_ipvs *data = par->matchinfo;
>> +     const u_int8_t family = par->family;
>> +     struct ip_vs_iphdr iph;
>> +     struct ip_vs_protocol *pp;
>> +     struct ip_vs_conn *cp;
>> +     int af;
>> +     bool match = true;
>> +
>> +     if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
>> +             match = skb->ipvs_property ^
>> +                     !!(data->invert & XT_IPVS_IPVS_PROPERTY);
>> +             goto out;
>> +     }
>> +
>> +     /* other flags than XT_IPVS_IPVS_PROPERTY are set */
>> +     if (!skb->ipvs_property) {
>> +             match = false;
>> +             goto out;
>> +     }
>> +
>> +     switch (skb->protocol) {
>> +     case  htons(ETH_P_IP):
>> +             af = AF_INET;
>> +             break;
>> +#ifdef CONFIG_IP_VS_IPV6
>> +     case  htons(ETH_P_IPV6):
>> +             af = AF_INET6;
>> +             break;
>> +#endif
>> +     default:
>> +             match = false;
>> +             goto out;
>> +     }
>
> In the NF_INET_LOCAL_OUT hook skb->protocol is invalid. So if you
> don't need this, it would make sense to restrict the match to the
> other hooks.
>
> Even easier would be to use par->family, which contains the address
> family and doesn't need any translation.

Nice, I'll use par->family.

So in theory I do not even need a check like the following in the beginning?

	if (family != NFPROTO_IPV4
#ifdef CONFIG_IP_VS_IPV6
	    && family != NFPROTO_IPV6
#endif
		) {
		match = false;
		goto out;
	}

Thanks,
-Hannes

^ permalink raw reply

* Re: r8189 mac address changes persist across reboot (was: duplicate MAC addresses)
From: Sergey Vlasov @ 2009-09-02 14:32 UTC (permalink / raw)
  To: Ivan Vecera
  Cc: Alan Jenkins, Francois Romieu, marty, linux-hotplug, netdev,
	linux-kernel, Mikael Pettersson
In-Reply-To: <4A9CD443.6030603@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]

On Tue, Sep 01, 2009 at 09:58:59AM +0200, Ivan Vecera wrote:
> Alan Jenkins napsal(a):
[...]
> > Looking at r8169.c confirms this.  It doesn't appear to initialize the
> > MAC address register from elsewhere; it just uses the current value.
> > It will also report this initial value as the "permanent" MAC address,
> > which your report suggests is wrong.  I think your problem is a bug in
> > r8169.
> > 
> > Francois, I found a datasheet for the 8139; it was claimed to be
> > similar and it does indeed appear so.  The datasheet suggests that the
> > driver needs to provoke "auto-load" from the EEPROM at load time.
> > Could you please have a look at fixing this, so that MAC address
> > changes do not persist over a reboot?
> > 
> Using auto-loading method is not possible for all new Realtek products
> (mainly for PCI-E chipsets). I asked the Realtek engineers and this
> is the answer:
> Q: Is it possible to use HW register at offset 50h to reload the MAC
>    address from EEPROM or somewhere else? I mean the usage of Auto-load
>    mode (bits EEM1=0 and EEM0=1 in 50h HW reg).
> A: Current new LANs don't load mac address throught autoload command.
>    You need to read it out from external eeprom or internal efuse then
>    put it back to ID0~ID5.
> 
> So you need to read the MAC address from the EEPROM and then write it into
> ID0-ID5 registers. I already created the patch that initializes the MAC
> address from EEPROM but there were some issues with this patch so it was
> reverted. Mikael reported that the MAC address from its adapter (on Thecus
> n2100) is read only partly (first 3 bytes were correct but the rest were
> zeros). Later we found that the MAC address is correct and there are really
> 3 correct bytes + 3 zeros in EEPROM. The Thecus n2100 system probably uses
> only these 3 bytes and the remaining 3 bytes fills in by itself (they are
> probably stored somewhere in the firmware).

Unfortunately, this kind of crap (crucial information such as MAC
addresses stored in places known only to some proprietary firmware) is
too common with recent devices (e.g., forcedeth has the same problem
even on PCs).

> I have tested my patch with several different realtek NICs without any
> problem but what should we do with embedded system like the n2100 that
> initializes the MAC in other way.
> 
> There are 2 possibilities:
> 1) There could be an additional module param to enable/disable the MAC
>    initialization
> 2) The MAC address read from EEPROM will be checked against the current
>    MAC address in ID0-5 registers. And the current MAC will be replaced
>    by the one from EEPROM only if the first 3 bytes (OUI part) are
>    different.

 3) Try the same solution as forcedeth does - save original contents
    of MAC address registers in rtl8169_init_one() (we already have
    perm_addr available for this; forcedeth uses a separate variable
    due to its "reversed MAC" brain damage), then restore MAC to its
    initial state in rtl8169_remove_one() (to handle module reload)
    and rtl_shutdown() (for soft reboot or kexec), and hope that on an
    unexpected hard reboot the firmware will reinit the chip properly.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] IPVS: make friends with nf_conntrack
From: Patrick McHardy @ 2009-09-02 14:56 UTC (permalink / raw)
  To: Hannes Eder
  Cc: lvs-devel, linux-kernel, netdev, netfilter-devel,
	Fabien Duchêne, Jan Engelhardt, Jean-Luc Fortemaison,
	Julian Anastasov, Julius Volz, Laurent Grawet, Simon Horman,
	Wensong Zhang
In-Reply-To: <20090902101538.11561.11911.stgit@jazzy.zrh.corp.google.com>

Hannes Eder wrote:
> Update the nf_conntrack tuple in reply direction, as we will see
> traffic from the real server (RIP) to the client (CIP).  Once this is
> done we can use netfilters SNAT in POSTROUTING, especially with
> xt_ipvs, to do source NAT, e.g.:
> 
> % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 8080 \
>> -j SNAT --to-source 192.168.10.10
> 

> +static void
> +ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
> +{
> +	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
> +
> +	if (ct == NULL || ct == &nf_conntrack_untracked ||
> +	    nf_ct_is_confirmed(ct))
> +		return;
> +
> +	/*
> +	 * The connection is not yet in the hashtable, so we update it.
> +	 * CIP->VIP will remain the same, so leave the tuple in
> +	 * IP_CT_DIR_ORIGINAL untouched.  When the reply comes back from the
> +	 * real-server we will see RIP->DIP.
> +	 */
> +	ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u3 = cp->daddr;
> +	/*
> +	 * This will also take care of UDP and other protocols.
> +	 */
> +	ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u.tcp.port = cp->dport;
> +}

How does IPVS interact with conntrack helpers? If it does actually
intend to use them (which will happen automatically), it might make
sense to use nf_conntrack_alter_reply(), which will perform a new
helper lookup based on the changed tuple.


^ permalink raw reply

* Re: [PATCH 1/3] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Patrick McHardy @ 2009-09-02 14:54 UTC (permalink / raw)
  To: Hannes Eder
  Cc: lvs-devel, linux-kernel, netdev, netfilter-devel,
	Fabien Duchêne, Jan Engelhardt, Jean-Luc Fortemaison,
	Julian Anastasov, Julius Volz, Laurent Grawet, Simon Horman,
	Wensong Zhang
In-Reply-To: <20090902101527.11561.59498.stgit@jazzy.zrh.corp.google.com>

Hannes Eder wrote:
> This implements the kernel-space side of the netfilter matcher
> xt_ipvs.

Looks mostly fine to me, just one question:

> +bool ipvs_mt(const struct sk_buff *skb, const struct xt_match_param *par)
> +{
> +	const struct xt_ipvs *data = par->matchinfo;
> +	const u_int8_t family = par->family;
> +	struct ip_vs_iphdr iph;
> +	struct ip_vs_protocol *pp;
> +	struct ip_vs_conn *cp;
> +	int af;
> +	bool match = true;
> +
> +	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
> +		match = skb->ipvs_property ^
> +			!!(data->invert & XT_IPVS_IPVS_PROPERTY);
> +		goto out;
> +	}
> +
> +	/* other flags than XT_IPVS_IPVS_PROPERTY are set */
> +	if (!skb->ipvs_property) {
> +		match = false;
> +		goto out;
> +	}
> +
> +	switch (skb->protocol) {
> +	case  htons(ETH_P_IP):
> +		af = AF_INET;
> +		break;
> +#ifdef CONFIG_IP_VS_IPV6
> +	case  htons(ETH_P_IPV6):
> +		af = AF_INET6;
> +		break;
> +#endif
> +	default:
> +		match = false;
> +		goto out;
> +	}

In the NF_INET_LOCAL_OUT hook skb->protocol is invalid. So if you
don't need this, it would make sense to restrict the match to the
other hooks.

Even easier would be to use par->family, which contains the address
family and doesn't need any translation.

^ permalink raw reply

* Re: 2.6.31 ARP related problems
From: Alexander Duyck @ 2009-09-02 14:47 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, netdev, Eric Dumazet, Duyck, Alexander H,
	Kirsher, Jeffrey T, David Miller
In-Reply-To: <4A9E62FB.6090000@Voltaire.com>

I don't suspect this has much of an effect on the Virtualization use
case for SR-IOV since the VFs are meant to be direct assigned as PCI
devices to the individual VMs so they won't even show up in the
routing table.  For the most part the igbvf driver typically will be
black listed for the host kernel since it already has the PF
interface, and the driver will only be loaded on the guests.

You can probably also reproduce the issue by placing multiple physical
network interfaces on the same network segment if you saw the same
effect on SR-IOV since that is essentially the effect the VFs create
due to the switching logic built into the 82576.

Thanks,

Alex

On Wed, Sep 2, 2009 at 5:20 AM, Or Gerlitz<ogerlitz@voltaire.com> wrote:
> Eric W. Biederman wrote:
>> I just tested.  If the two macvlans are in separate network namespaces all is well,
>> so definitely not macvlan. As you have observed there are no real changes to arp.c
>
> Yes, it's not macvlan to blame, I just tested it in SR-IOV scheme with igb/igbvf and
> I see the same problem, only ping that goes through / targeted to the IP address of the first
> VF device routing hit works, which means SR-IOV isn't really usable with 2.6.31 when you want
> a multiple VMs scheme, each attached to a different VF and all VMs on the same network segment.
>
>
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
From: Eric Dumazet @ 2009-09-02 14:43 UTC (permalink / raw)
  To: David Miller; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages
In-Reply-To: <20090901.184121.06750444.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 31 Aug 2009 14:09:50 +0200
> 
>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>
>> May I suggest following path :
>>
>> 1) Correct ip6_push_pending_frames() to properly
>> account for dropped-by-qdisc frames when IP_RECVERR is set
> 
> Your patch is  applied to net-next-2.6, thanks!
> 
>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>> but still return a OK to user application, to not break them ?
> 
> Sounds good.
> 
> I think if you sample random UDP applications, you will find that such
> errors will upset them terribly, make them log tons of crap to
> /var/log/messages et al., and consume tons of CPU.
> 
> And in such cases silent ignoring of drops is entirely appropriate and
> optimal, which supports our current behavior.
> 
> If we are to make such applications "more sophisticated" such
> converted apps can be indicated simply their use of IP_RECVERR.
> 
> If you want to be notified of all asynchronous errors we can detect,
> you use this, end of story.  It is the only way to handle this
> situation without breaking the world.
> 
> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> accurate, and wise.  And he understood all of this 10+ years ago.

Thanks David, here is the 2nd patch then :


[PATCH net-next-2.6] ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s 
...
IP:
    1487048 outgoing packets dropped
...
Udp:
...
    SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

 "The output queue for a network interface was full.
  This generally indicates that the interface has stopped sending,
  but may be caused by transient congestion.
  (Normally, this does not occur in Linux. Packets are just silently
  dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ip.h      |    2 +-
 include/net/ipv6.h    |    2 +-
 include/net/udp.h     |    2 +-
 net/ipv4/icmp.c       |    2 +-
 net/ipv4/ip_output.c  |   19 ++++++++++---------
 net/ipv4/raw.c        |   14 ++++++++++----
 net/ipv4/udp.c        |   20 +++++++++++++-------
 net/ipv6/icmp.c       |    2 +-
 net/ipv6/ip6_output.c |   18 +++++++++++-------
 net/ipv6/raw.c        |   15 ++++++++++-----
 net/ipv6/udp.c        |   14 ++++++++++----
 11 files changed, 69 insertions(+), 41 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 72c3692..9dd19a8 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,7 +116,7 @@ extern int		ip_append_data(struct sock *sk,
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
-extern int		ip_push_pending_frames(struct sock *sk);
+extern int		ip_push_pending_frames(struct sock *sk, int recverr);
 extern void		ip_flush_pending_frames(struct sock *sk);
 
 /* datagram.c */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ad9a511..f514257 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -498,7 +498,7 @@ extern int			ip6_append_data(struct sock *sk,
 						struct rt6_info *rt,
 						unsigned int flags);
 
-extern int			ip6_push_pending_frames(struct sock *sk);
+extern int			ip6_push_pending_frames(struct sock *sk, int recverr);
 
 extern void			ip6_flush_pending_frames(struct sock *sk);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 5fb029f..a60ef10 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -145,7 +145,7 @@ extern int 	udp_lib_getsockopt(struct sock *sk, int level, int optname,
 			           char __user *optval, int __user *optlen);
 extern int 	udp_lib_setsockopt(struct sock *sk, int level, int optname,
 				   char __user *optval, int optlen,
-				   int (*push_pending_frames)(struct sock *));
+				   int (*push_pending_frames)(struct sock *, int));
 
 extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 				    __be32 daddr, __be16 dport,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 97c410e..f46a53c 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -345,7 +345,7 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
 						 icmp_param->head_len, csum);
 		icmph->checksum = csum_fold(csum);
 		skb->ip_summed = CHECKSUM_NONE;
-		ip_push_pending_frames(sk);
+		ip_push_pending_frames(sk, 0);
 	}
 }
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..8f81dab 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1216,7 +1216,7 @@ static void ip_cork_release(struct inet_sock *inet)
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-int ip_push_pending_frames(struct sock *sk)
+int ip_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1301,19 +1301,20 @@ int ip_push_pending_frames(struct sock *sk)
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
-		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+		if (err > 0) {
+			err = net_xmit_errno(err);
+			if (err && !recverr) {
+				IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+				err = 0;
+			}
+		}
 		if (err)
-			goto error;
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 /*
@@ -1412,7 +1413,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 			  arg->csumoffset) = csum_fold(csum_add(skb->csum,
 								arg->csum));
 		skb->ip_summed = CHECKSUM_NONE;
-		ip_push_pending_frames(sk);
+		ip_push_pending_frames(sk, 0);
 	}
 
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..444c465 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -374,8 +374,13 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 
 	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
-	if (err > 0)
-		err = inet->recverr ? net_xmit_errno(err) : 0;
+	if (err > 0) {
+		err = net_xmit_errno(err);
+		if (!inet->recverr && err) {
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			err = 0;
+		}
+	}
 	if (err)
 		goto error;
 out:
@@ -576,8 +581,9 @@ back_from_confirm:
 					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
-		else if (!(msg->msg_flags & MSG_MORE))
-			err = ip_push_pending_frames(sk);
+		else if (!(msg->msg_flags & MSG_MORE)) {
+			err = ip_push_pending_frames(sk, inet->recverr);
+		}
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 29ebb0d..6a6bf1d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -513,7 +513,7 @@ static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
 /*
  * Push out all pending data as one UDP datagram. Socket is locked.
  */
-static int udp_push_pending_frames(struct sock *sk)
+static int udp_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct udp_sock  *up = udp_sk(sk);
 	struct inet_sock *inet = inet_sk(sk);
@@ -560,7 +560,7 @@ static int udp_push_pending_frames(struct sock *sk)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip_push_pending_frames(sk);
+	err = ip_push_pending_frames(sk, recverr);
 out:
 	up->len = 0;
 	up->pending = 0;
@@ -752,8 +752,14 @@ do_append_data:
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_flush_pending_frames(sk);
-	else if (!corkreq)
-		err = udp_push_pending_frames(sk);
+	else if (!corkreq) {
+		err = udp_push_pending_frames(sk, 1);
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	}
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;
 	release_sock(sk);
@@ -826,7 +832,7 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 
 	up->len += size;
 	if (!(up->corkflag || (flags&MSG_MORE)))
-		ret = udp_push_pending_frames(sk);
+		ret = udp_push_pending_frames(sk, inet_sk(sk)->recverr);
 	if (!ret)
 		ret = size;
 out:
@@ -1354,7 +1360,7 @@ void udp_destroy_sock(struct sock *sk)
  */
 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		       char __user *optval, int optlen,
-		       int (*push_pending_frames)(struct sock *))
+		       int (*push_pending_frames)(struct sock *, int))
 {
 	struct udp_sock *up = udp_sk(sk);
 	int val;
@@ -1374,7 +1380,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			up->corkflag = 0;
 			lock_sock(sk);
-			(*push_pending_frames)(sk);
+			(*push_pending_frames)(sk, 0);
 			release_sock(sk);
 		}
 		break;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index e2325f6..a9c54c2 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -253,7 +253,7 @@ static int icmpv6_push_pending_frames(struct sock *sk, struct flowi *fl, struct
 						      len, fl->proto,
 						      tmp_csum);
 	}
-	ip6_push_pending_frames(sk);
+	ip6_push_pending_frames(sk, 0);
 out:
 	return err;
 }
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a931229..ade5707 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1440,7 +1440,7 @@ static void ip6_cork_release(struct inet_sock *inet, struct ipv6_pinfo *np)
 	memset(&inet->cork.fl, 0, sizeof(inet->cork.fl));
 }
 
-int ip6_push_pending_frames(struct sock *sk)
+int ip6_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1510,18 +1510,22 @@ int ip6_push_pending_frames(struct sock *sk)
 
 	err = ip6_local_out(skb);
 	if (err) {
-		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+		if (err > 0) {
+			err = net_xmit_errno(err);
+			if (err && !recverr) {
+				IP6_INC_STATS(net, rt->rt6i_idev,
+					      IPSTATS_MIB_OUTDISCARDS);
+				err = 0;
+			}
+		}
 		if (err)
-			goto error;
+			IP6_INC_STATS(net, rt->rt6i_idev,
+				      IPSTATS_MIB_OUTDISCARDS);
 	}
 
 out:
 	ip6_cork_release(inet, np);
 	return err;
-error:
-	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 void ip6_flush_pending_frames(struct sock *sk)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5068410..d054fa2 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -523,7 +523,7 @@ csum_copy_err:
 }
 
 static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
-				     struct raw6_sock *rp)
+				     struct raw6_sock *rp, int recverr)
 {
 	struct sk_buff *skb;
 	int err = 0;
@@ -595,7 +595,7 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
 		BUG();
 
 send:
-	err = ip6_push_pending_frames(sk);
+	err = ip6_push_pending_frames(sk, recverr);
 out:
 	return err;
 }
@@ -641,8 +641,13 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
 	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
-	if (err > 0)
-		err = np->recverr ? net_xmit_errno(err) : 0;
+	if (err > 0) {
+		err = net_xmit_errno(err);
+		if (!np->recverr && err) {
+			IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
+			err = 0;
+		}
+	}
 	if (err)
 		goto error;
 out:
@@ -895,7 +900,7 @@ back_from_confirm:
 		if (err)
 			ip6_flush_pending_frames(sk);
 		else if (!(msg->msg_flags & MSG_MORE))
-			err = rawv6_push_pending_frames(sk, &fl, rp);
+			err = rawv6_push_pending_frames(sk, &fl, rp, np->recverr);
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 20d2ffc..963dd0a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -683,7 +683,7 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
  *	Sending
  */
 
-static int udp_v6_push_pending_frames(struct sock *sk)
+static int udp_v6_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb;
 	struct udphdr *uh;
@@ -723,7 +723,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip6_push_pending_frames(sk);
+	err = ip6_push_pending_frames(sk, recverr);
 out:
 	up->len = 0;
 	up->pending = 0;
@@ -975,8 +975,14 @@ do_append_data:
 		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_v6_flush_pending_frames(sk);
-	else if (!corkreq)
-		err = udp_v6_push_pending_frames(sk);
+	else if (!corkreq) {
+		err = udp_v6_push_pending_frames(sk, 1);
+		if (err == -ENOBUFS && !np->recverr) {
+			UDP6_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	}
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;
 

^ permalink raw reply related

* [PATCH 3/3] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Hannes Eder @ 2009-09-02 14:41 UTC (permalink / raw)
  To: lvs-devel
  Cc: linux-kernel, netdev, netfilter-devel, Fabien Duchêne,
	Jan Engelhardt, Jean-Luc Fortemaison, Julian Anastasov,
	Julius Volz, Laurent Grawet, Patrick McHardy, Simon Horman,
	Wensong Zhang
In-Reply-To: <20090902101417.11561.45663.stgit@jazzy.zrh.corp.google.com>

The user-space library for the netfilter matcher xt_ipvs.

Signed-off-by: Hannes Eder <heder@google.com>
---

 configure.ac                      |   11 +
 extensions/libxt_ipvs.c           |  349 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   21 ++
 include/linux/netfilter/xt_ipvs.h |   23 ++
 4 files changed, 401 insertions(+), 3 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

diff --git a/configure.ac b/configure.ac
index bc74efe..e55ab43 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,3 @@
-
 AC_INIT([iptables], [1.4.4])
 
 # See libtool.info "Libtool's versioning system"
@@ -47,12 +46,18 @@ AC_ARG_WITH([pkgconfigdir], AS_HELP_STRING([--with-pkgconfigdir=PATH],
 	[Path to the pkgconfig directory [[LIBDIR/pkgconfig]]]),
 	[pkgconfigdir="$withval"], [pkgconfigdir='${libdir}/pkgconfig'])
 
-AC_CHECK_HEADER([linux/dccp.h])
-
 blacklist_modules="";
+
+AC_CHECK_HEADER([linux/dccp.h])
 if test "$ac_cv_header_linux_dccp_h" != "yes"; then
 	blacklist_modules="$blacklist_modules dccp";
 fi;
+
+AC_CHECK_HEADER([linux/ip_vs.h])
+if test "$ac_cv_header_linux_ip_vs_h" != "yes"; then
+	blacklist_modules="$blacklist_modules ipvs";
+fi;
+
 AC_SUBST([blacklist_modules])
 
 AM_CONDITIONAL([ENABLE_STATIC], [test "$enable_static" = "yes"])
diff --git a/extensions/libxt_ipvs.c b/extensions/libxt_ipvs.c
new file mode 100644
index 0000000..9fd007f
--- /dev/null
+++ b/extensions/libxt_ipvs.c
@@ -0,0 +1,349 @@
+/* Shared library add-on to iptables to add IPVS matching.
+ *
+ * Detailed doc is in the kernel module source net/netfilter/xt_ipvs.c
+ *
+ * Author: Hannes Eder <heder@google.com>
+ */
+#include <sys/types.h>
+#include <assert.h>
+#include <ctype.h>
+#include <errno.h>
+#include <getopt.h>
+#include <netdb.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <xtables.h>
+#include <linux/ip_vs.h>
+#include <linux/netfilter/xt_ipvs.h>
+
+static const struct option ipvs_mt_opts[] = {
+	{ .name = "ipvs",    .has_arg = false, .val = '0' },
+	{ .name = "vproto",  .has_arg = true,  .val = '1' },
+	{ .name = "vaddr",   .has_arg = true,  .val = '2' },
+	{ .name = "vport",   .has_arg = true,  .val = '3' },
+	{ .name = "vdir",    .has_arg = true,  .val = '4' },
+	{ .name = "vmethod", .has_arg = true,  .val = '5' },
+	{ .name = NULL }
+};
+
+static void ipvs_mt_help(void)
+{
+	printf(
+"IPVS match options:\n"
+"[!] --ipvs                      packet belongs to an IPVS connection\n"
+"\n"
+"Any of the following options implies --ipvs (even negated)\n"
+"[!] --vproto protocol           VIP protocol to match; by number or name,\n"
+"                                e.g. \"tcp\"\n"
+"[!] --vaddr address[/mask]      VIP address to match\n"
+"[!] --vport port                VIP port to match; by number or name,\n"
+"                                e.g. \"http\"\n"
+"    --vdir {ORIGINAL|REPLY}     flow direction of packet\n"
+"[!] --vmethod {GATE|IPIP|MASQ}  IPVS forwarding method used\n"
+		);
+}
+
+static void ipvs_mt_parse_addr_and_mask(const char *arg,
+					union nf_inet_addr *address,
+					union nf_inet_addr *mask,
+					unsigned int family)
+{
+	struct in_addr *addr = NULL;
+	struct in6_addr *addr6 = NULL;
+	unsigned int naddrs = 0;
+
+	if (family == NFPROTO_IPV4) {
+		xtables_ipparse_any(arg, &addr, &mask->in, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in, addr, sizeof(*addr));
+	} else if (family == NFPROTO_IPV6) {
+		xtables_ip6parse_any(arg, &addr6, &mask->in6, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in6, addr6, sizeof(*addr6));
+	} else {
+		/* Hu? */
+		assert(false);
+	}
+}
+
+/* Function which parses command options; returns true if it ate an option */
+static int ipvs_mt_parse(int c, char **argv, int invert, unsigned int *flags,
+			 const void *entry, struct xt_entry_match **match,
+			 unsigned int family)
+{
+	struct xt_ipvs *data = (void *)(*match)->data;
+	char *p = NULL;
+	u_int8_t op = 0;
+
+	if ('0' <= c && c <= '5') {
+		int ops[] = {
+			XT_IPVS_IPVS_PROPERTY,
+			XT_IPVS_PROTO,
+			XT_IPVS_VADDR,
+			XT_IPVS_VPORT,
+			XT_IPVS_DIR,
+			XT_IPVS_METHOD
+		};
+		op = ops[c - '0'];
+	} else
+		return 0;
+
+	if (*flags & op & XT_IPVS_ONCE_MASK)
+		goto multiple_use;
+
+	switch (c) {
+	case '0': /* --ipvs */
+		/* Nothing to do here. */
+		break;
+
+	case '1': /* --vproto */
+		/* Canonicalize into lower case */
+		for (p = optarg; *p != '\0'; ++p)
+			*p = tolower(*p);
+
+		data->l4proto = xtables_parse_protocol(optarg);
+		break;
+
+	case '2': /* --vaddr */
+		ipvs_mt_parse_addr_and_mask(optarg, &data->vaddr,
+					    &data->vmask, family);
+		break;
+
+	case '3': /* --vport */
+		data->vport = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	case '4': /* --vdir */
+		xtables_param_act(XTF_NO_INVERT, "ipvs", "--vdir", invert);
+		if (strcasecmp(optarg, "ORIGINAL") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert   &= ~XT_IPVS_DIR;
+		} else if (strcasecmp(optarg, "REPLY") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert  |= XT_IPVS_DIR;
+		} else {
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vdir", optarg);
+		}
+		break;
+
+	case '5': /* --vmethod */
+		if (strcasecmp(optarg, "GATE") == 0)
+			data->fwd_method = IP_VS_CONN_F_DROUTE;
+		else if (strcasecmp(optarg, "IPIP") == 0)
+			data->fwd_method = IP_VS_CONN_F_TUNNEL;
+		else if (strcasecmp(optarg, "MASQ") == 0)
+			data->fwd_method = IP_VS_CONN_F_MASQ;
+		else
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vmethod", optarg);
+		break;
+
+	default:
+		/* Hu? How did we come here? */
+		assert(false);
+		return 0;
+	}
+
+	if (op & XT_IPVS_ONCE_MASK) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			xtables_error(PARAMETER_PROBLEM,
+				      "! --ipvs cannot be together with"
+				      " other options");
+		data->bitmask |= XT_IPVS_IPVS_PROPERTY;
+	}
+
+	data->bitmask |= op;
+	if (invert)
+		data->invert |= op;
+	*flags |= op;
+	return 1;
+
+multiple_use:
+	xtables_error(PARAMETER_PROBLEM,
+		      "multiple use of the same IPVS option is not allowed");
+}
+
+static int ipvs_mt4_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV4);
+}
+
+static int ipvs_mt6_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV6);
+}
+
+static void ipvs_mt_check(unsigned int flags)
+{
+	if (flags == 0)
+		xtables_error(PARAMETER_PROBLEM,
+			      "IPVS: At least one option is required");
+}
+
+/* Shamelessly copied from libxt_conntrack.c */
+static void ipvs_mt_dump_addr(const union nf_inet_addr *addr,
+			      const union nf_inet_addr *mask,
+			      unsigned int family, bool numeric)
+{
+	char buf[BUFSIZ];
+
+	if (family == NFPROTO_IPV4) {
+		if (!numeric && addr->ip == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ipaddr_to_numeric(&addr->in));
+		else
+			strcpy(buf, xtables_ipaddr_to_anyname(&addr->in));
+		strcat(buf, xtables_ipmask_to_numeric(&mask->in));
+		printf("%s ", buf);
+	} else if (family == NFPROTO_IPV6) {
+		if (!numeric && addr->ip6[0] == 0 && addr->ip6[1] == 0 &&
+		    addr->ip6[2] == 0 && addr->ip6[3] == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ip6addr_to_numeric(&addr->in6));
+		else
+			strcpy(buf, xtables_ip6addr_to_anyname(&addr->in6));
+		strcat(buf, xtables_ip6mask_to_numeric(&mask->in6));
+		printf("%s ", buf);
+	}
+}
+
+static void ipvs_mt_dump(const void *ip, const struct xt_ipvs *data,
+			 unsigned int family, bool numeric, const char *prefix)
+{
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			printf("! ");
+		printf("%sipvs ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_PROTO) {
+		if (data->invert & XT_IPVS_PROTO)
+			printf("! ");
+		printf("%sproto %u ", prefix, data->l4proto);
+	}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (data->invert & XT_IPVS_VADDR)
+			printf("! ");
+
+		printf("%svaddr ", prefix);
+		ipvs_mt_dump_addr(&data->vaddr, &data->vmask, family, numeric);
+	}
+
+	if (data->bitmask & XT_IPVS_VPORT) {
+		if (data->invert & XT_IPVS_VPORT)
+			printf("! ");
+
+		printf("%svport %u ", prefix, ntohs(data->vport));
+	}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		if (data->invert & XT_IPVS_DIR)
+			printf("%svdir REPLY ", prefix);
+		else
+			printf("%svdir ORIGINAL ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD) {
+		if (data->invert & XT_IPVS_METHOD)
+			printf("! ");
+
+		printf("%svmethod ", prefix);
+		switch (data->fwd_method) {
+		case IP_VS_CONN_F_DROUTE:
+			printf("GATE ");
+			break;
+		case IP_VS_CONN_F_TUNNEL:
+			printf("IPIP ");
+			break;
+		case IP_VS_CONN_F_MASQ:
+			printf("MASQ ");
+			break;
+		default:
+			/* Hu? */
+			printf("UNKNOWN ");
+			break;
+		}
+	}
+}
+
+static void ipvs_mt4_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, numeric, "");
+}
+
+static void ipvs_mt6_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, numeric, "");
+}
+
+static void ipvs_mt4_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, true, "--");
+}
+
+static void ipvs_mt6_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, true, "--");
+}
+
+static struct xtables_match ipvs_matches_reg[] = {
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV4,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt4_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt4_print,
+		.save          = ipvs_mt4_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV6,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt6_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt6_print,
+		.save          = ipvs_mt6_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+};
+
+void _init(void)
+{
+	xtables_register_matches(ipvs_matches_reg,
+				 ARRAY_SIZE(ipvs_matches_reg));
+}
diff --git a/extensions/libxt_ipvs.man b/extensions/libxt_ipvs.man
new file mode 100644
index 0000000..7fe915f
--- /dev/null
+++ b/extensions/libxt_ipvs.man
@@ -0,0 +1,21 @@
+Match IPVS connection properties.
+.TP
+[\fB!\fR] \fB\-\-ipvs\fP
+packet belongs to an IPVS connection
+.TP
+Any of the following options implies \-\-ipvs (even negated)
+.TP
+[\fB!\fR] \fB\-\-vproto\fP \fIprotocol\fP
+VIP protocol to match; by number or name, e.g. "tcp"
+.TP
+[\fB!\fR] \fB\-\-vaddr\fP \fIaddress\fP[\fB/\fP\fImask\fP]
+VIP address to match
+.TP
+[\fB!\fR] \fB\-\-vport\fP \fIport\fP
+VIP port to match; by number or name, e.g. "http"
+.TP
+\fB\-\-vdir\fP {\fBORIGINAL\fP|\fBREPLY\fP}
+flow direction of packet
+.TP
+[\fB!\fR] \fB\-\-vmethod\fP {\fBGATE\fP|\fBIPIP\fP|\fBMASQ\fP}
+IPVS forwarding method used
diff --git a/include/linux/netfilter/xt_ipvs.h b/include/linux/netfilter/xt_ipvs.h
new file mode 100644
index 0000000..eb09759
--- /dev/null
+++ b/include/linux/netfilter/xt_ipvs.h
@@ -0,0 +1,23 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H 1
+
+#define XT_IPVS_IPVS_PROPERTY	0x01 /* this is implied by all other options */
+#define XT_IPVS_PROTO		0x02
+#define XT_IPVS_VADDR		0x04
+#define XT_IPVS_VPORT		0x08
+#define XT_IPVS_DIR		0x10
+#define XT_IPVS_METHOD		0x20
+#define XT_IPVS_MASK		(0x40 - 1)
+#define XT_IPVS_ONCE_MASK	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
+
+struct xt_ipvs {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u16			l4proto;
+	__u16			fwd_method;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */


^ permalink raw reply related

* Re: ipsec not forwarding (suspect SA issue)
From: Andrew Dickinson @ 2009-09-02 14:41 UTC (permalink / raw)
  To: netdev
In-Reply-To: <606676310909011157g9ce5377gabc30a63da897049@mail.gmail.com>

Just to follow up on this... it was an SA issue, but due to my config
and not a kernel problem. :D

specifically, I changed the "uniques" to "requires" in the spdadd
lines and dropped the "fwd" entry that I'd manually added... that
straightened everything out :D

-A

On Tue, Sep 1, 2009 at 11:57 AM, Andrew Dickinson<andrew@whydna.net> wrote:
> Howdy netdev,
>
> First, I'm not positive that this is the right list for this question,
> so feel free to steer me in the right direction.  I'm trying to work
> out an issue with ipsec not forwarding traffic from my LAN down my
> tunnel.  I've walked through the troubleshooting-doc on the lartc site
> and everything seems kosher...
>
> Here's my setup.
>
> I've got a linux-based router/firewall on the edge of my network with
> two interfaces, $WAN and $LAN.  The router is MASQUERADING to the
> internet.  My LAN is 10.0.0.0/24.  I'm trying to peer with a remote
> network 10.254.0.0/23.  The remote network does not have internet
> connectivity, so all non-10.254/23 traffic should traverse the VPN to
> get to my router to go to the internet or my local LAN.
>
> I'm using racoon and setkey to establish the VPN tunnel and BGP (via
> quagga) to advertise routes into the remote network.  The routers are
> using 169.254.255.0/30 for BGP.
>
> The problem that I'm having is that traffic from my LAN to 10.254/23
> is not going down the VPN tunnel; it just disappears.  I can see it
> come in on the LAN interface, but I don't see it leave the WAN
> interface as either unencrypted traffic or as esp traffic.  Traffic
> from the router, however, works fine.
>
> ------ BEGIN racoon.conf ------
> log info;
>
> path pre_shared_key "/etc/racoon/psk.txt";
>
>
> listen {
>    adminsock "/var/run/racoon/racoon.sock" "root" "operator" 0660;
>    isakmp <MY_IP>
> }
>
> timer {
>    counter 5;
>    interval 20 sec;
>    persend 1;
>    phase1 30 sec;
>    phase2 15 sec;
> }
>
> remote anonymous {
>    exchange_mode main,aggressive,base;
>    lifetime time 28800 sec;
>    proposal_check obey;
>    dpd_delay 10;
>    dpd_retry 10;
>    dpd_maxfail 3;
>    esp_frag 1396;
>    proposal {
>        encryption_algorithm aes;
>        hash_algorithm sha1;
>        authentication_method pre_shared_key;
>        dh_group 2;
>    }
> }
>
> sainfo anonymous {
>    authentication_algorithm hmac_sha1;
>    encryption_algorithm aes;
>    lifetime time 3600 seconds;
>    compression_algorithm deflate;
>    pfs_group 2;
> }
>
> ------- END racoon.conf -------
>
> ------ BEGIN setkey.conf -----
> flush;
> spdflush;
>
>
> spdadd 169.254.255.1 169.254.255.2 any -P in ipsec
>    esp/tunnel/REMOTE_IP-MY_IP/unique;
> spdadd 169.254.255.2 169.254.255.1 any -P out ipsec
>    esp/tunnel/MY_IP-REMOTE_IP/unique;
>
> spdadd 10.254.0.0/23 0.0.0.0/0 any -P in ipsec
>    esp/tunnel/REMOTE_IP-MY_IP/unique;
> spdadd 0.0.0.0/0 10.254.0.0/23 any -P fwd ipsec
>    esp/tunnel/MY_IP-REMOTE_IP/unique;
> spdadd 0.0.0.0/0 10.254.0.0/23 any -P out ipsec
>    esp/tunnel/MY_IP-REMOTE_IP/unique;
>
> spdadd 0.0.0.0/0 0.0.0.0/0 254 -P in ipsec
>    esp/tunnel/REMOTE_IP-MY_IP/unique;
> spdadd 0.0.0.0/0 0.0.0.0/0 254 -P out ipsec
>    esp/tunnel/MY_IP-REMOTE_IP/unique;
>
> spdadd 0.0.0.0/0 0.0.0.0/0 tcp -P in none;
> spdadd 0.0.0.0/0 0.0.0.0/0 tcp -P out none;
> spdadd 0.0.0.0/0 0.0.0.0/0 udp -P in none;
> spdadd 0.0.0.0/0 0.0.0.0/0 udp -P out none;
>
> ----- END setkey.conf ----
>
> There's two things that are potentially funky with this config that
> I'm not proud of (and which might potentially be part of my problem).
> When racoon goes to phase2 negotiation, it looks for an SPD with 0/0 -
> 0/0 [any] .   I've installed an SPD of 0/0 - 0/0 [254] in order to
> make racoon happy.  This isn't a problem for me as I don't have any
> traffic using IP protocol #254 (obviously).  The other thing is that
> I'm explicitly adding a fwd rule... that was my effort to try to fix
> my problem (it didn't help).  Beyond that, the rest of the rules seem
> fairly straight forward.
>
> Further, when I initially connect the VPN, I see racoon do an SA
> negotiation for the 0/0 rules.  When I start quagga, I see it do an SA
> for the 169.254... rules.  If I ping a remote machine from the
> routers, I see it do an SA for the 10.254.0.0/23 rules.  But if I ping
> from something on my LAN there's no negotiation (this is true whether
> I ping from the router first or not).
>
> Here's what I've double checked:
> 1) iptables nat table has rules to ACCEPT 10.254.0.0/23 destined
> traffic to prevent it from being MASQUERADE'd (which I see counters
> for when I ping from the router)
> 2) iptables (main) table has FORWARD rules to ACCEPT 10.254.0.0/23
> destined traffic (which I never see counters for)
> 3) IP forwarding is enabled (as this router is happily forwarding
> other traffic to-from the LAN to the internet)
>
> It seems like this is an issue with an SA not getting found for
> forwarding traffic and the kernel silently dropping the packet.  How
> do I debug this?
>
> -A
>

^ permalink raw reply

* [PATCH 2/3] IPVS: make friends with nf_conntrack
From: Hannes Eder @ 2009-09-02 14:39 UTC (permalink / raw)
  To: lvs-devel
  Cc: linux-kernel, netdev, netfilter-devel, Fabien Duchêne,
	Jan Engelhardt, Jean-Luc Fortemaison, Julian Anastasov,
	Julius Volz, Laurent Grawet, Patrick McHardy, Simon Horman,
	Wensong Zhang
In-Reply-To: <20090902101417.11561.45663.stgit@jazzy.zrh.corp.google.com>

Update the nf_conntrack tuple in reply direction, as we will see
traffic from the real server (RIP) to the client (CIP).  Once this is
done we can use netfilters SNAT in POSTROUTING, especially with
xt_ipvs, to do source NAT, e.g.:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 8080 \
> -j SNAT --to-source 192.168.10.10

Signed-off-by: Hannes Eder <heder@google.com>
---

 net/netfilter/ipvs/Kconfig      |    2 +-
 net/netfilter/ipvs/ip_vs_core.c |   36 ------------------------------------
 net/netfilter/ipvs/ip_vs_xmit.c |   27 +++++++++++++++++++++++++++
 3 files changed, 28 insertions(+), 37 deletions(-)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 79a6980..fca5379 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -3,7 +3,7 @@
 #
 menuconfig IP_VS
 	tristate "IP virtual server support"
-	depends on NET && INET && NETFILTER
+	depends on NET && INET && NETFILTER && NF_CONNTRACK
 	---help---
 	  IP Virtual Server support will let you build a high-performance
 	  virtual server based on cluster of two or more real servers. This
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b227750..27bd002 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -521,26 +521,6 @@ int ip_vs_leave(struct ip_vs_service *svc, struct sk_buff *skb,
 	return NF_DROP;
 }
 
-
-/*
- *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- *      chain, and is used for VS/NAT.
- *      It detects packets for VS/NAT connections and sends the packets
- *      immediately. This can avoid that iptable_nat mangles the packets
- *      for VS/NAT.
- */
-static unsigned int ip_vs_post_routing(unsigned int hooknum,
-				       struct sk_buff *skb,
-				       const struct net_device *in,
-				       const struct net_device *out,
-				       int (*okfn)(struct sk_buff *))
-{
-	if (!skb->ipvs_property)
-		return NF_ACCEPT;
-	/* The packet was sent from IPVS, exit this chain */
-	return NF_STOP;
-}
-
 __sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
 {
 	return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
@@ -1431,14 +1411,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP_PRI_NAT_SRC-1,
-	},
 #ifdef CONFIG_IP_VS_IPV6
 	/* After packet filtering, forward packet through VS/DR, VS/TUN,
 	 * or VS/NAT(change destination), so that filtering rules can be
@@ -1467,14 +1439,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET6,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP6_PRI_NAT_SRC-1,
-	},
 #endif
 };
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 30b3189..fc7d6a4 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -27,6 +27,7 @@
 #include <net/ip6_route.h>
 #include <linux/icmpv6.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
 #include <linux/netfilter_ipv4.h>
 
 #include <net/ip_vs.h>
@@ -347,6 +348,28 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 }
 #endif
 
+static void
+ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
+{
+	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+
+	if (ct == NULL || ct == &nf_conntrack_untracked ||
+	    nf_ct_is_confirmed(ct))
+		return;
+
+	/*
+	 * The connection is not yet in the hashtable, so we update it.
+	 * CIP->VIP will remain the same, so leave the tuple in
+	 * IP_CT_DIR_ORIGINAL untouched.  When the reply comes back from the
+	 * real-server we will see RIP->DIP.
+	 */
+	ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u3 = cp->daddr;
+	/*
+	 * This will also take care of UDP and other protocols.
+	 */
+	ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u.tcp.port = cp->dport;
+}
+
 /*
  *      NAT transmitter (only for outside-to-inside nat forwarding)
  *      Not used for related ICMP
@@ -402,6 +425,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */
@@ -478,6 +503,8 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */

^ permalink raw reply related

* [PATCH 1/3] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Hannes Eder @ 2009-09-02 14:39 UTC (permalink / raw)
  To: lvs-devel
  Cc: linux-kernel, netdev, netfilter-devel, Fabien Duchêne,
	Jan Engelhardt, Jean-Luc Fortemaison, Julian Anastasov,
	Julius Volz, Laurent Grawet, Patrick McHardy, Simon Horman,
	Wensong Zhang
In-Reply-To: <20090902101417.11561.45663.stgit@jazzy.zrh.corp.google.com>

This implements the kernel-space side of the netfilter matcher
xt_ipvs.

Signed-off-by: Hannes Eder <heder@google.com>
---

 include/linux/netfilter/xt_ipvs.h |   23 +++++
 net/netfilter/Kconfig             |    9 ++
 net/netfilter/Makefile            |    1 
 net/netfilter/ipvs/ip_vs_proto.c  |    1 
 net/netfilter/xt_ipvs.c           |  183 +++++++++++++++++++++++++++++++++++++
 5 files changed, 217 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c

diff --git a/include/linux/netfilter/xt_ipvs.h b/include/linux/netfilter/xt_ipvs.h
new file mode 100644
index 0000000..eb09759
--- /dev/null
+++ b/include/linux/netfilter/xt_ipvs.h
@@ -0,0 +1,23 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H 1
+
+#define XT_IPVS_IPVS_PROPERTY	0x01 /* this is implied by all other options */
+#define XT_IPVS_PROTO		0x02
+#define XT_IPVS_VADDR		0x04
+#define XT_IPVS_VPORT		0x08
+#define XT_IPVS_DIR		0x10
+#define XT_IPVS_METHOD		0x20
+#define XT_IPVS_MASK		(0x40 - 1)
+#define XT_IPVS_ONCE_MASK	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
+
+struct xt_ipvs {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u16			l4proto;
+	__u16			fwd_method;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 634d14a..fc35bd6 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -678,6 +678,15 @@ config NETFILTER_XT_MATCH_IPRANGE
 
 	If unsure, say M.
 
+config NETFILTER_XT_MATCH_IPVS
+	tristate '"ipvs" match support'
+	depends on IP_VS
+	depends on NETFILTER_ADVANCED
+	help
+	  This option allows you to match against IPVS properties of a packet.
+
+	  If unsure, say N.
+
 config NETFILTER_XT_MATCH_LENGTH
 	tristate '"length" match support'
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 49f62ee..ff95372 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_HASHLIMIT) += xt_hashlimit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_HELPER) += xt_helper.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_HL) += xt_hl.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_IPRANGE) += xt_iprange.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_IPVS) += xt_ipvs.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LENGTH) += xt_length.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LIMIT) += xt_limit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_MAC) += xt_mac.o
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 3e76716..db083c3 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -97,6 +97,7 @@ struct ip_vs_protocol * ip_vs_proto_get(unsigned short proto)
 
 	return NULL;
 }
+EXPORT_SYMBOL(ip_vs_proto_get);
 
 
 /*
diff --git a/net/netfilter/xt_ipvs.c b/net/netfilter/xt_ipvs.c
new file mode 100644
index 0000000..579b053
--- /dev/null
+++ b/net/netfilter/xt_ipvs.c
@@ -0,0 +1,183 @@
+/*
+ *	xt_ipvs - kernel module to match IPVS connection properties
+ *
+ *	Author: Hannes Eder <heder@google.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/spinlock.h>
+#include <linux/skbuff.h>
+#ifdef CONFIG_IP_VS_IPV6
+#include <net/ipv6.h>
+#endif
+#include <linux/ip_vs.h>
+#include <linux/types.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_ipvs.h>
+#include <net/netfilter/nf_conntrack.h>
+
+#include <net/ip_vs.h>
+
+MODULE_AUTHOR("Hannes Eder <heder@google.com>");
+MODULE_DESCRIPTION("Xtables: match IPVS connection properties");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_ipvs");
+MODULE_ALIAS("ip6t_ipvs");
+
+/* borrowed from xt_conntrack */
+static bool ipvs_mt_addrcmp(const union nf_inet_addr *kaddr,
+			    const union nf_inet_addr *uaddr,
+			    const union nf_inet_addr *umask,
+			    unsigned int l3proto)
+{
+	if (l3proto == NFPROTO_IPV4)
+		return ((kaddr->ip ^ uaddr->ip) & umask->ip) == 0;
+#ifdef CONFIG_IP_VS_IPV6
+	else if (l3proto == NFPROTO_IPV6)
+		return ipv6_masked_addr_cmp(&kaddr->in6, &umask->in6,
+		       &uaddr->in6) == 0;
+#endif
+	else
+		return false;
+}
+
+bool ipvs_mt(const struct sk_buff *skb, const struct xt_match_param *par)
+{
+	const struct xt_ipvs *data = par->matchinfo;
+	const u_int8_t family = par->family;
+	struct ip_vs_iphdr iph;
+	struct ip_vs_protocol *pp;
+	struct ip_vs_conn *cp;
+	int af;
+	bool match = true;
+
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		match = skb->ipvs_property ^
+			!!(data->invert & XT_IPVS_IPVS_PROPERTY);
+		goto out;
+	}
+
+	/* other flags than XT_IPVS_IPVS_PROPERTY are set */
+	if (!skb->ipvs_property) {
+		match = false;
+		goto out;
+	}
+
+	switch (skb->protocol) {
+	case  htons(ETH_P_IP):
+		af = AF_INET;
+		break;
+#ifdef CONFIG_IP_VS_IPV6
+	case  htons(ETH_P_IPV6):
+		af = AF_INET6;
+		break;
+#endif
+	default:
+		match = false;
+		goto out;
+	}
+
+	ip_vs_fill_iphdr(af, skb_network_header(skb), &iph);
+
+	if (data->bitmask & XT_IPVS_PROTO)
+		if ((iph.protocol == data->l4proto) ^
+		    !(data->invert & XT_IPVS_PROTO)) {
+			match = false;
+			goto out;
+		}
+
+	pp = ip_vs_proto_get(iph.protocol);
+	if (unlikely(!pp)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * Check if the packet belongs to an existing entry
+	 */
+	cp = pp->conn_out_get(af, skb, pp, &iph, iph.len, 1 /* inverse */);
+	if (unlikely(cp == NULL)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * We found a connection, i.e. ct != 0, make sure to call
+	 * __ip_vs_conn_put before returning.  In our case jump to out_put_con.
+	 */
+
+	if (data->bitmask & XT_IPVS_VPORT)
+		if ((cp->vport == data->vport) ^
+		    !(data->invert & XT_IPVS_VPORT)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+
+		if (ct == NULL || ct == &nf_conntrack_untracked) {
+			match = false;
+			goto out_put_cp;
+		}
+
+		if ((ctinfo >= IP_CT_IS_REPLY) ^
+		    !!(data->invert & XT_IPVS_DIR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD)
+		if (((cp->flags & IP_VS_CONN_F_FWD_MASK) == data->fwd_method) ^
+		    !(data->invert & XT_IPVS_METHOD)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (af != family) {
+			match = false;
+			goto out_put_cp;
+		}
+
+		if (ipvs_mt_addrcmp(&cp->vaddr, &data->vaddr,
+				    &data->vmask, af) ^
+		    !(data->invert & XT_IPVS_VADDR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+out_put_cp:
+	__ip_vs_conn_put(cp);
+out:
+	pr_debug("match=%d\n", match);
+	return match;
+}
+
+static struct xt_match xt_ipvs_mt_reg __read_mostly = {
+	.name       = "ipvs",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.match      = ipvs_mt,
+	.matchsize  = sizeof(struct xt_ipvs),
+	.me         = THIS_MODULE,
+};
+
+static int __init ipvs_mt_init(void)
+{
+	return xt_register_match(&xt_ipvs_mt_reg);
+}
+
+static void __exit ipvs_mt_exit(void)
+{
+	xt_unregister_match(&xt_ipvs_mt_reg);
+}
+
+module_init(ipvs_mt_init);
+module_exit(ipvs_mt_exit);


^ permalink raw reply related

* [PATCH 0/3] IPVS full NAT support + netfilter 'ipvs' match support
From: Hannes Eder @ 2009-09-02 14:38 UTC (permalink / raw)
  To: lvs-devel
  Cc: linux-kernel, netdev, netfilter-devel, Fabien Duchêne,
	Jan Engelhardt, Jean-Luc Fortemaison, Julian Anastasov,
	Julius Volz, Laurent Grawet, Patrick McHardy, Simon Horman,
	Wensong Zhang

The following series implements full NAT support for IPVS.  The
approach is via a minimal change to IPVS (make friends with
nf_conntrack) and adding a netfilter matcher, kernel- and user-space
part, i.e. xt_ipvs and libxt_ipvs.

Example usage:

% ipvsadm -A -t 192.168.100.30:8080 -s rr
% ipvsadm -a -t 192.168.100.30:8080 -r 192.168.10.20:8080 -m
# ...

# Source NAT for VIP 192.168.100.30:8080
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 8080 \
> -j SNAT --to-source 192.168.10.10


Changes to the linux kernel (rebased to next-20090831):

Hannes Eder (2):
      netfilter: xt_ipvs (netfilter matcher for IPVS)
      IPVS: make friends with nf_conntrack


 include/linux/netfilter/xt_ipvs.h |   23 +++++
 net/netfilter/Kconfig             |    9 ++
 net/netfilter/Makefile            |    1 
 net/netfilter/ipvs/Kconfig        |    2 
 net/netfilter/ipvs/ip_vs_core.c   |   36 -------
 net/netfilter/ipvs/ip_vs_proto.c  |    1 
 net/netfilter/ipvs/ip_vs_xmit.c   |   27 +++++
 net/netfilter/xt_ipvs.c           |  183 +++++++++++++++++++++++++++++++++++++
 8 files changed, 245 insertions(+), 37 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c


Changs to iptables:

Hannes Eder (1):
      libxt_ipvs: user space lib for netfilter matcher xt_ipvs


 configure.ac                      |   11 +
 extensions/libxt_ipvs.c           |  349 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   21 ++
 include/linux/netfilter/xt_ipvs.h |   23 ++
 4 files changed, 401 insertions(+), 3 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h


^ permalink raw reply

* Re: Crypto oops in async_chainiv_do_postponed
From: Brad Bosch @ 2009-09-02 14:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto, netdev, offbase0
In-Reply-To: <20090901221721.GA1964@gondor.apana.org.au>

(resent due to bounce notification for vger)
Herbert Xu writes:
 > On Tue, Sep 01, 2009 at 10:42:44AM -0500, Brad Bosch wrote:
 > > 
 > > Now, ctx-err may be used by both async_chainiv_postpone_request to
 > > store the return value from skcipher_enqueue_givcrypt and by
 > > async_chainiv_givencrypt_tail to store the return value from
 > > crypto_ablkcipher_encrypt at the same time.  This can cause the
 > > calling function to think async_chainiv_givencrypt has completed it's
 > > work, when in fact, the work was defered.
 > 
 > async_chainiv_postpone_request never touches ctx->err unless
 > it can obtain the INUSE bit lock.  On the other hand, the normal
 > patch async_chainiv_givencrypt_tail never relinquishes the INUSE
 > bit until it is finisehd with ctx->err.

But the above statements are not adequate to demonstrate that your use
of the INUSE flag always prevents a condition where both
async_chainiv_postpone_request and async_chainiv_givencrypt_tail
operate on the same ctx at the same time.  The flaw in your logic may
be that async_chainiv_schedule_work does not have solid assurance that
it's thread is the one that holds the INUSE bit when it calls
clear_bit.

I seem to have trouble getting the details right in describing a path
that causes both uses of ctx->err to happen at the same time.  Let me
try again.

Assume the worker thread is executing between the dequeue in
async_chainiv_do_postponed and the clear_bit call in
async_chainiv_schedule_work.  Further assume that we are processing
the last item on the queue so durring this time, ctx->queue.qlen =
0.

Meanwhile, three threads enter async_chainiv_givencrypt for the same
ctx at about the same time.

Thread one calls test_and_set_bit which returns 1 and calls
async_cahiniv_postpone_request but suppose it has not yet enqueued.
Now INUSE is set and qlen=0.

Next, the worker thread calls clear_bit in async_chainiv_schedule_work
but it is interrupted before it can call test_and_set_bit.  Now INUSE
is clear and qlen=0

The test_and_set_bit in thread two is called at this moment and
returns 0 and then calls async_chainiv_givencrypt_tail.  Now INUSE is
set and qlen=0.

Thread one now locks the ctx and calls skcipher_enqueue_givcrypt and
unlocks.  Now INUSE is set and qlen=1.

Thread three calls test_and_set_bit which returns 1 and then it clears
INUSE since qlen=1 and it calls postpone with INUSE clear and qlen=1

Now thread three will use ctx->err to hold the return value of
skcipher_enqueue_givcrypt at the same time as thread two uses ctx->err
to hold the return value of crypto_ablkcipher_encrypt!

Did I make a mistake above?  I suspect more bad things can happen as
well in this scenario, but I'm just focusing on the use of ctx->err here.

 > 
 > Please let me know whether it actually fixes your problem though
 > so I can get this upstream.

Unfortunately, the offset problem is not easily reproduced with our
application, so testing long enough to be sure the problem is fixed
(assuming that it was indeed the cause of the oops) may not be
practical.  All I can say at the moment is that I have not seen the
crash since I introduced the two patches I sent you.

Thanks for taking the time to discuss this!

--Brad

^ permalink raw reply

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
From: Christoph Lameter @ 2009-09-02 18:22 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, sri, dlstevens, netdev, niv, mtk.manpages
In-Reply-To: <20090901.184121.06750444.davem@davemloft.net>

On Tue, 1 Sep 2009, David Miller wrote:

> > 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
> > but still return a OK to user application, to not break them ?
>
> Sounds good.

Great. That was my initial suggestion and it would ensure that no apps
break.

> If we are to make such applications "more sophisticated" such
> converted apps can be indicated simply their use of IP_RECVERR.

There may be a minor issue here in that IP_RECVERR sometimes sends error
packets that have to be intercepted using special code. Or can those be
simply ignored? If so then I will ask UDP app vendors to use IP_RECVERR.

> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> accurate, and wise.  And he understood all of this 10+ years ago.

His code was just slightly buggy .... ;-)

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Christoph Lameter @ 2009-09-02 18:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jarek Poplawski, David Miller, Patrick McHardy, netdev
In-Reply-To: <4A9E2CC7.1010103@gmail.com>

On Wed, 2 Sep 2009, Eric Dumazet wrote:

> Same name "eth0" is displayed, that might confuse parsers...
>
> What naming convention should we choose for multiqueue devices ?

eth0/tx<number> ?



^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Christoph Lameter @ 2009-09-02 18:12 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: eric.dumazet, David Miller, Patrick McHardy, netdev
In-Reply-To: <20090902081429.GB4878@ff.dom.local>


On Wed, 2 Sep 2009, Jarek Poplawski wrote:

> I think, tc should've no problem with displaying summary stats of
> multiqueue qdiscs or even all of them separately, as mentioned by
> Patrick. And, maybe I still miss something, but there should be
> nothing special with tc vs. localhost either.

Ok. Can you come up with a patch? net/sched/sch_api.c can likely be
patches with the loop logic that Eric suggested earlier.

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Christoph Lameter @ 2009-09-02 18:11 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: eric.dumazet, netdev, David Miller, Patrick McHardy
In-Reply-To: <20090902080921.GA4878@ff.dom.local>

On Wed, 2 Sep 2009, Jarek Poplawski wrote:

> Then my humble suggestions would be to reserve more space for most of
> the columns to make it readable not only for scripts when more TX#,
> bytes, packets etc. Users of non-default qdiscs would also miss things
> like: q->ops->id, q->handle, and q->parent at least. Plus, as I
> mentioned earlier, q->qstats.qlen update with q->q.qlen (or using it
> directly) is needed.

Which of those are needed if we just want to focus on statistics? Next rev
will have q->q.len.


^ permalink raw reply

* Re: Crypto oops in async_chainiv_do_postponed
From: Brad Bosch @ 2009-09-02 14:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Brad Bosch, linux-crypto, netdev, offbase0
In-Reply-To: <20090901221721.GA1964@gondor.apana.org.au>

Herbert Xu writes:
 > On Tue, Sep 01, 2009 at 10:42:44AM -0500, Brad Bosch wrote:
 > > 
 > > Now, ctx-err may be used by both async_chainiv_postpone_request to
 > > store the return value from skcipher_enqueue_givcrypt and by
 > > async_chainiv_givencrypt_tail to store the return value from
 > > crypto_ablkcipher_encrypt at the same time.  This can cause the
 > > calling function to think async_chainiv_givencrypt has completed it's
 > > work, when in fact, the work was defered.
 > 
 > async_chainiv_postpone_request never touches ctx->err unless
 > it can obtain the INUSE bit lock.  On the other hand, the normal
 > patch async_chainiv_givencrypt_tail never relinquishes the INUSE
 > bit until it is finisehd with ctx->err.

But the above statements are not adequate to demonstrate that your use
of the INUSE flag always prevents a condition where both
async_chainiv_postpone_request and async_chainiv_givencrypt_tail
operate on the same ctx at the same time.  The flaw in your logic may
be that async_chainiv_schedule_work does not have solid assurance that
it's thread is the one that holds the INUSE bit when it calls
clear_bit.

I seem to have trouble getting the details right in describing a path
that causes both uses of ctx->err to happen at the same time.  Let me
try again.

Assume the worker thread is executing between the dequeue in
async_chainiv_do_postponed and the clear_bit call in
async_chainiv_schedule_work.  Further assume that we are processing
the last item on the queue so durring this time, ctx->queue.qlen =
0.

Meanwhile, three threads enter async_chainiv_givencrypt for the same
ctx at about the same time.

Thread one calls test_and_set_bit which returns 1 and calls
async_cahiniv_postpone_request but suppose it has not yet enqueued.
Now INUSE is set and qlen=0.

Next, the worker thread calls clear_bit in async_chainiv_schedule_work
but it is interrupted before it can call test_and_set_bit.  Now INUSE
is clear and qlen=0

The test_and_set_bit in thread two is called at this moment and
returns 0 and then calls async_chainiv_givencrypt_tail.  Now INUSE is
set and qlen=0.

Thread one now locks the ctx and calls skcipher_enqueue_givcrypt and
unlocks.  Now INUSE is set and qlen=1.

Thread three calls test_and_set_bit which returns 1 and then it clears
INUSE since qlen=1 and it calls postpone with INUSE clear and qlen=1

Now thread three will use ctx->err to hold the return value of
skcipher_enqueue_givcrypt at the same time as thread two uses ctx->err
to hold the return value of crypto_ablkcipher_encrypt!

Did I make a mistake above?  I suspect more bad things can happen as
well in this scenario, but I'm just focusing on the use of ctx->err here.

 > 
 > Please let me know whether it actually fixes your problem though
 > so I can get this upstream.

Unfortunately, the offset problem is not easily reproduced with our
application, so testing long enough to be sure the problem is fixed
(assuming that it was indeed the cause of the oops) may not be
practical.  All I can say at the moment is that I have not seen the
crash since I introduced the two patches I sent you.

Thanks for taking the time to discuss this!

--Brad

^ permalink raw reply

* Re: [PATCH net-next-2.6] tc: report informations for multiqueue devices
From: Patrick McHardy @ 2009-09-02 14:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jarkao2, cl, netdev
In-Reply-To: <4A9E7807.2080901@gmail.com>

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> Eric Dumazet wrote:
>>> [PATCH net-next-2.6] tc: report informations for multiqueue devices
>>>
>>> qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
>>> for multiqueue devices.
>>>
>>> We use a new TCA_QINDEX attribute, to report queue index to user space.
>>> iproute2 tc should be changed to eventually display this queue index as in :
>>>
>>> $ tc -s -d qdisc
>>> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>>  Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
>>>  rate 0bit 0pps backlog 0b 0p requeues 0
>>> qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>>  rate 0bit 0pps backlog 0b 0p requeues 0
>> This might confuse existing userspace since the handle is not unique
>> anymore. libnl f.i. will treat all but the first root qdisc as an
>> update and use it to update the state of the first one. There's also
>> no combined view for applications unaware of multiqueue.
>>
>> Please have a look at the mail I just wrote for some possible ways
>> around this.
> 
> Hum, how can we combine infos on qdisc/class if in the future we allow each queue index
> to have its own qdisc/classes ?
> 
> htb on queue index 0
> cbq on queue index 1

My suggestion was to only dump the statistics in the combined
view and use a virtual qdisc, something like:

qdisc multiqueue 0: dev eth0 root queues 8
  Sent ...
  rate ...

and show each real qdisc as child of this qdisc:

qdisc pfifo_fast <unique handle> dev eth0 parent 0: bands 3 ...
qdisc pfifo_fast <unique handle> dev eth0 parent 0: bands 3 ...

Configuration would be symetrical to this:

tc qdisc add dev eth0 handle 0: root multiqueue
tc qdisc add dev eth0 handle x: parent 0: pfifo_fast
...

without the virtual multiqueue qdisc, the root qdisc would simply
be shared among all queues as today.

> Combining info would lock us and not allow for special configurations.
> Say 
>    macvlan device 0 mapped to queue index 0
>    macvlan device 1 mapped to queue index 1...

Why not?

> For old apps, just give informations for queue 0 as we do now, and
> allow kernel to give more informations only if new application provided a TCA_INDEX attribute
> in its request ?
> 
> (-1 : all queue indexes,  >=0 for a given queue index)

If we don't combine the information, existing multiqueue unaware
applications will get incorrect information. There's also the
problem of non-unique handles, I think we should encode the queue
index in the handle instead of using a new attribute. This needs
a bit more thought though to avoid clashes with user-defined handles.

^ permalink raw reply

* Re: [PATCH 5/5] net: file_operations should be const
From: John W. Linville @ 2009-09-02 13:53 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, Samuel Ortiz, netdev, linux-wireless
In-Reply-To: <20090902052538.601908379@vyatta.com>

On Tue, Sep 01, 2009 at 10:25:05PM -0700, Stephen Hemminger wrote:
> All instances of file_operations should be const.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

ACK

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [PATCH 1/5] netdev: drivers should make ethtool_ops const
From: John W. Linville @ 2009-09-02 13:53 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, Roland Dreier, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Dhananjay Phadke,
	David Brownell
In-Reply-To: <20090902052538.276324751-ZtmgI6mnKB3QT0dZR+AlfA@public.gmane.org>

On Tue, Sep 01, 2009 at 10:25:01PM -0700, Stephen Hemminger wrote:
> No need to put ethtool_ops in data, they should be const.
> 
> Signed-off-by: Stephen Hemminger <shemminger-ZtmgI6mnKB3QT0dZR+AlfA@public.gmane.org>

ACK

-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next-2.6] tc: report informations for multiqueue devices
From: Eric Dumazet @ 2009-09-02 13:49 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, jarkao2, cl, netdev
In-Reply-To: <4A9E708D.5040806@trash.net>

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> [PATCH net-next-2.6] tc: report informations for multiqueue devices
>>
>> qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
>> for multiqueue devices.
>>
>> We use a new TCA_QINDEX attribute, to report queue index to user space.
>> iproute2 tc should be changed to eventually display this queue index as in :
>>
>> $ tc -s -d qdisc
>> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>  Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
>>  rate 0bit 0pps backlog 0b 0p requeues 0
>> qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>  rate 0bit 0pps backlog 0b 0p requeues 0
> 
> This might confuse existing userspace since the handle is not unique
> anymore. libnl f.i. will treat all but the first root qdisc as an
> update and use it to update the state of the first one. There's also
> no combined view for applications unaware of multiqueue.
> 
> Please have a look at the mail I just wrote for some possible ways
> around this.

Hum, how can we combine infos on qdisc/class if in the future we allow each queue index
to have its own qdisc/classes ?

htb on queue index 0
cbq on queue index 1

Combining info would lock us and not allow for special configurations.
Say 
   macvlan device 0 mapped to queue index 0
   macvlan device 1 mapped to queue index 1...

For old apps, just give informations for queue 0 as we do now, and
allow kernel to give more informations only if new application provided a TCA_INDEX attribute
in its request ?

(-1 : all queue indexes,  >=0 for a given queue index)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox