netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SIOCADDMULTI for unicast broken
@ 2003-01-03 21:46 jamal
  2003-01-04  0:07 ` Donald Becker
  2003-01-04  1:18 ` Jeff Garzik
  0 siblings, 2 replies; 21+ messages in thread
From: jamal @ 2003-01-03 21:46 UTC (permalink / raw)
  To: Jeff Garzik, Donald Becker; +Cc: netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 441 bytes --]


Some programs require ability to accept packets destined to certain
MAC addresses (in addition to their own).
Example Jerome Ettienes vrrpd (http://w3.arobas.net/~jetienne/vrrpd/)

The trick is to add unicast addresses via SIOCADDMULTI and accept those
packets when they make their way up the stack.
I think this used to work, no? Donald, any history/comments behind
this?
Patch attahced, not very well tested but looks safe.

cheers,
jamal

[-- Attachment #2: Type: TEXT/PLAIN, Size: 1329 bytes --]

--- net/ethernet/eth.c	2003/01/03 18:20:49	1.1
+++ net/ethernet/eth.c	2003/01/03 18:22:05
@@ -148,6 +148,30 @@
 	return 0;
 }
 
+void check_mcast_list(struct sk_buff *skb, struct net_device *dev)
+{
+	struct dev_mc_list *dmi;
+	struct ethhdr *eth;
+			        
+	if (skb->pkt_type != PACKET_OTHERHOST)
+		return;
+
+	eth = skb->mac.ethernet;
+
+	/* may not be necessary to bh_lock - fix later - JHS */
+	spin_lock_bh(&dev->xmit_lock);
+
+	for (dmi = dev->mc_list; dmi != NULL; dmi = dmi->next) {
+		if (memcmp(dmi->dmi_addr, eth->h_dest, dev->addr_len) == 0
+		    && dmi->dmi_addrlen == dev->addr_len) { 
+			skb->pkt_type = PACKET_HOST;
+				break;
+		}
+	}
+
+	spin_unlock_bh(&dev->xmit_lock);
+}
+
 
 /*
  *	Determine the packet's protocol ID. The rule here is that we 
@@ -182,8 +206,14 @@
 	 
 	else if(1 /*dev->flags&IFF_PROMISC*/)
 	{
-		if(memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN))
-			skb->pkt_type=PACKET_OTHERHOST;
+		if(memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN)) {
+			skb->pkt_type = PACKET_OTHERHOST;
+			/* we override PACKET_OTHERHOST if MAC appears
+			 * in our mcast list allows to have several 
+			 * allowed MACs for receives added via 
+			 * SIOCADDMULTI on the device*/
+			check_mcast_list(skb,dev);
+		}
 	}
 	
 	if (ntohs(eth->h_proto) >= 1536)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-03 21:46 SIOCADDMULTI for unicast broken jamal
@ 2003-01-04  0:07 ` Donald Becker
  2003-01-04  1:18 ` Jeff Garzik
  1 sibling, 0 replies; 21+ messages in thread
From: Donald Becker @ 2003-01-04  0:07 UTC (permalink / raw)
  To: jamal; +Cc: Jeff Garzik, netdev

On Fri, 3 Jan 2003, jamal wrote:

> Subject: SIOCADDMULTI for unicast broken
>
> Some programs require ability to accept packets destined to certain
> MAC addresses (in addition to their own).
> Example Jerome Ettienes vrrpd (http://w3.arobas.net/~jetienne/vrrpd/)
> 
> The trick is to add unicast addresses via SIOCADDMULTI and accept those
> packets when they make their way up the stack.
> I think this used to work, no? Donald, any history/comments behind
> this?

This is a very specialized requirement, so specialized that it should
not be added as general-purpose requirement for drivers or the network
stack.

This capability was supported as a special case for the Tulip driver,
and then only for the real 21*4* chips that had the hardware CAM capable
of matching up to 16 destination addresses.

A few other chips support matching unicast addresses with the multicast
filter, but there is the general problem of false-accepts and
chip-specific quirks that must be dealt with.

Once again: this is a very specialized thing.  Of the few people that
think they need the capability, most are wrong.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-03 21:46 SIOCADDMULTI for unicast broken jamal
  2003-01-04  0:07 ` Donald Becker
@ 2003-01-04  1:18 ` Jeff Garzik
  2003-01-04  1:39   ` Donald Becker
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Garzik @ 2003-01-04  1:18 UTC (permalink / raw)
  To: jamal; +Cc: Donald Becker, netdev

jamal wrote:
> Some programs require ability to accept packets destined to certain
> MAC addresses (in addition to their own).
> Example Jerome Ettienes vrrpd (http://w3.arobas.net/~jetienne/vrrpd/)
> 
> The trick is to add unicast addresses via SIOCADDMULTI and accept those
> packets when they make their way up the stack.
> I think this used to work, no? Donald, any history/comments behind
> this?


Over and above Donald's comments, from an interface perspective I think 
this is a bit of a hack, don't you?  :)  Calling an "add-multi" ioctl 
should do precisely that... and only that :)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  1:18 ` Jeff Garzik
@ 2003-01-04  1:39   ` Donald Becker
  2003-01-04  1:45     ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Donald Becker @ 2003-01-04  1:39 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: jamal, netdev

On Fri, 3 Jan 2003, Jeff Garzik wrote:
> jamal wrote:
> > Some programs require ability to accept packets destined to certain
> > MAC addresses (in addition to their own).
> > Example Jerome Ettienes vrrpd (http://w3.arobas.net/~jetienne/vrrpd/)
> > 
> > The trick is to add unicast addresses via SIOCADDMULTI and accept those
> > packets when they make their way up the stack.
> > I think this used to work, no? Donald, any history/comments behind
> > this?
> 
> Over and above Donald's comments, from an interface perspective I think 
> this is a bit of a hack, don't you?  :)  Calling an "add-multi" ioctl 
> should do precisely that... and only that :)

Yes, it is totally a hack, not an interface.

It was
  "If you need this capability for a RESEARCH PROJECT, you can buy this
  specific board and thus not need to modify the kernel or device
  driver. "

You can also find a few people that want to receive specific corrupted
packets, change the meaning of LEDs on a NIC, and do many other strange
things.  But we don't need a defined kernel interface for each one.

Improvement is what you can eliminate or simplify, not adding complexity.


-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  1:39   ` Donald Becker
@ 2003-01-04  1:45     ` Ben Greear
  2003-01-04  1:52       ` Jeff Garzik
  2003-01-04  2:18       ` Donald Becker
  0 siblings, 2 replies; 21+ messages in thread
From: Ben Greear @ 2003-01-04  1:45 UTC (permalink / raw)
  To: Donald Becker; +Cc: Jeff Garzik, jamal, netdev

Donald Becker wrote:

> It was
>   "If you need this capability for a RESEARCH PROJECT, you can buy this
>   specific board and thus not need to modify the kernel or device
>   driver. "
> 
> You can also find a few people that want to receive specific corrupted
> packets, change the meaning of LEDs on a NIC, and do many other strange
> things.  But we don't need a defined kernel interface for each one.

Just out of curiosity, what is the suggested manner for adding such
back-door hacks as this?  Maybe in a proc file system that the driver
implements?  It would be neat to see various driver-specific features
like this be implemented, and it would be even nicer if they followed
at least some general guideline for how to interface with the rest of
the world...

> 
> Improvement is what you can eliminate or simplify, not adding complexity.

Exposing new features can also be an improvement, though one does
not imply the other ;)

Ben

-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  1:45     ` Ben Greear
@ 2003-01-04  1:52       ` Jeff Garzik
  2003-01-06 15:00         ` Krzysztof Halasa
  2003-01-04  2:18       ` Donald Becker
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Garzik @ 2003-01-04  1:52 UTC (permalink / raw)
  To: Ben Greear; +Cc: Donald Becker, jamal, netdev

Ben Greear wrote:
> Donald Becker wrote:
> 
>> It was
>>   "If you need this capability for a RESEARCH PROJECT, you can buy this
>>   specific board and thus not need to modify the kernel or device
>>   driver. "
>>
>> You can also find a few people that want to receive specific corrupted
>> packets, change the meaning of LEDs on a NIC, and do many other strange
>> things.  But we don't need a defined kernel interface for each one.
> 
> 
> Just out of curiosity, what is the suggested manner for adding such
> back-door hacks as this?

SIOCDEVPRIVATE is staying around


 > Maybe in a proc file system that the driver
> implements?

No!  procfs additions are discouraged.  sysfs in 2.5.x if you _must_ do 
this, but SIOCDEVPRIVATE or just flat out maintaining a kernel patch 
against a stable kernel tree would be much preferred, I think.


 > It would be neat to see various driver-specific features
> like this be implemented, and it would be even nicer if they followed
> at least some general guideline for how to interface with the rest of
> the world...


Driver-specific features are by definition just that :)  If you want a 
general guideline, you'll also want a header or helper lib quite often 
to eliminate duplication of code and standardize the interface.

	Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  1:45     ` Ben Greear
  2003-01-04  1:52       ` Jeff Garzik
@ 2003-01-04  2:18       ` Donald Becker
  2003-01-04  4:11         ` jamal
  1 sibling, 1 reply; 21+ messages in thread
From: Donald Becker @ 2003-01-04  2:18 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jeff Garzik, jamal, netdev

On Fri, 3 Jan 2003, Ben Greear wrote:
> Donald Becker wrote:
> 
> > It was
> >   "If you need this capability for a RESEARCH PROJECT, you can buy this
> >   specific board and thus not need to modify the kernel or device
> >   driver. "
> > 
> > You can also find a few people that want to receive specific corrupted
> > packets, change the meaning of LEDs on a NIC, and do many other strange
> > things.  But we don't need a defined kernel interface for each one.
> 
> Just out of curiosity, what is the suggested manner for adding such
> back-door hacks as this?  Maybe in a proc file system that the driver
> implements?  It would be neat to see various driver-specific features
> like this be implemented, and it would be even nicer if they followed
> at least some general guideline for how to interface with the rest of
> the world...

The problems are
 - The extensions people (very few people) want are completely unpredictable.
 - Unique features are, well, unique

You might think it's Really Very Important to change the LED meanings on
your NIC.  For instance, if one LED means all Rx traffic and another Rx
accepted you can estimate how much traffic is for you.  And most NICs
provide a way to do this.  But no two are the same, and there is no
general way to describe the semantics.  So it's a capability best
ignored.  (*)

Un

* You can access this and many other capabilities through the MII
  ioctl() interface to vendor specific register, but not as an
  abstracted, hardware-independent feature.
  Why an ioctl() and not /proc?  Exporting MII registers via /proc is
  problematic because of Sticky Bits and clear-on-read semantics.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  2:18       ` Donald Becker
@ 2003-01-04  4:11         ` jamal
  2003-01-04  6:33           ` Donald Becker
  2003-01-04  7:32           ` Jeff Garzik
  0 siblings, 2 replies; 21+ messages in thread
From: jamal @ 2003-01-04  4:11 UTC (permalink / raw)
  To: Donald Becker; +Cc: Ben Greear, Jeff Garzik, netdev



Too many emails to respond to at once.

Q: Is this a hack?
A: Yes, indeed it is. wrong API is the main culprit.

Q: Is this a feature needed by only a few people?
A: No, Absolutely not. RFC2338 is one example that needs

such a feature for "aliasing" MAC addresses. RFC2338 is very popular
these days for some reason. I would think any schemes
that do HA takeover of another host would need such a feature.

How common are NICS such as the 21x4x that can be programmed to do
perfect hashing and accept multiple MAC addresses in hardware?
And if this was a commodity feature - what happens to PACKET_HOST
setting? an netdevice can only have one unicast MAC address.
SIOCDEVPRIVATE does not seem to be the right place to do this.
You still wanna have ability to do proper RFC2338 and related protocols
even when the h/ware is incapable.
And btw, i didnt even open up the whole can of worms - we also need to
respond back with proper MAC addresses to ARPs and packets sourced with
specific virtual router IPs. This is a seprate problem.

cheers,
jamal

PS: the hack credit (for using SIOCADDMULTI/DELMULTI) goes to Jerome
Ettiene (this is a guy who never responds to email, probably too busy
unicycling somewhere, so no point in ccing him) - Except it doesnt work
without the patch i posted.
On takeover, he sets the original MAC address as a receiving MAC
via SIOCADDMULTI and the allocated 00-00-5E-00-01-xx MAC to be
the main MAC address.
The weakness is when you want to run multiple virtual routers; each
one requires its own 00-00-5E-00-01-xx MAC.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  4:11         ` jamal
@ 2003-01-04  6:33           ` Donald Becker
  2003-01-04 17:41             ` jamal
  2003-01-04  7:32           ` Jeff Garzik
  1 sibling, 1 reply; 21+ messages in thread
From: Donald Becker @ 2003-01-04  6:33 UTC (permalink / raw)
  To: jamal; +Cc: Ben Greear, Jeff Garzik, netdev

On Fri, 3 Jan 2003, jamal wrote:

> Too many emails to respond to at once.
> 
> Q: Is this a hack?
> A: Yes, indeed it is. wrong API is the main culprit.
> 
> Q: Is this a feature needed by only a few people?
> A: No, Absolutely not. RFC2338 is one example that needs

The common way of handling this is unsolicited ARP.

> How common are NICS such as the 21x4x that can be programmed to do
> perfect hashing and accept multiple MAC addresses in hardware?

Not very common.  I mentioned the 21*4* explicitly because few other
common chips implement this feature.
The Digital design implemented it because of DECnet, which is long dead.

> And if this was a commodity feature - what happens to PACKET_HOST
> setting? an netdevice can only have one unicast MAC address.
> SIOCDEVPRIVATE does not seem to be the right place to do this.
> You still wanna have ability to do proper RFC2338 and related protocols
> even when the h/ware is incapable.
> And btw, i didnt even open up the whole can of worms - we also need to
> respond back with proper MAC addresses to ARPs and packets sourced with
> specific virtual router IPs. This is a seprate problem.

Yup, a whole can of worms if you want it to be a general feature handled
by the kernel...

> PS: the hack credit (for using SIOCADDMULTI/DELMULTI) goes to Jerome
> Ettiene (this is a guy who never responds to email, probably too busy
> unicycling somewhere, so no point in ccing him) - Except it doesnt work
> without the patch i posted.

This has worked with the Tulip driver for many years.  I've pointed it
out to a number of people that requested this as a new feature.


-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  4:11         ` jamal
  2003-01-04  6:33           ` Donald Becker
@ 2003-01-04  7:32           ` Jeff Garzik
  2003-01-04 17:43             ` jamal
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Garzik @ 2003-01-04  7:32 UTC (permalink / raw)
  To: jamal; +Cc: Donald Becker, Ben Greear, netdev

I wonder if there are any good uses for more advanced RX filtering that 
is beginning to appear.  I could certainly imagine an interface that was 
a more generic RX filtering interface, and [just by accident] happened 
to support existing unicast and multicast rx-mode-related controls.

As vendors stuff features onto cards and try to figure out where is the 
best dividing line between TCP stack acceleration and TCP stack offload, 
it seems to me that recent cards more often than not have nice RX 
filtering capabilities.  If you look at the world through GigE-colored 
glasses, the RX filtering picture gets even better.  There are some fun 
SMP implications with flexible enough RX filtering, for example.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  6:33           ` Donald Becker
@ 2003-01-04 17:41             ` jamal
  2003-01-04 18:24               ` Donald Becker
  2003-01-04 18:36               ` Julian Anastasov
  0 siblings, 2 replies; 21+ messages in thread
From: jamal @ 2003-01-04 17:41 UTC (permalink / raw)
  To: Donald Becker; +Cc: Ben Greear, Jeff Garzik, netdev



On Sat, 4 Jan 2003, Donald Becker wrote:

> On Fri, 3 Jan 2003, jamal wrote:
>
> > Too many emails to respond to at once.
> >
> > Q: Is this a hack?
> > A: Yes, indeed it is. wrong API is the main culprit.
> >
> > Q: Is this a feature needed by only a few people?
> > A: No, Absolutely not. RFC2338 is one example that needs
>
> The common way of handling this is unsolicited ARP.
>

unsolicited ARPs on failover are good. You send the arp with
one of the allocated MAC addresses as the source. The hosts
sending data use that MAC address as the dst MAC. Did i miss something
or how do you see the packet making its way up the stack?

> > How common are NICS such as the 21x4x that can be programmed to do
> > perfect hashing and accept multiple MAC addresses in hardware?
>
> Not very common.  I mentioned the 21*4* explicitly because few other
> common chips implement this feature.
> The Digital design implemented it because of DECnet, which is long dead.
>

Side, unrelated question:
for h/ware multicast filtered multicast addresses: What is the common
practise to handle the corner case where a hardware multicast address
is filled up. Take an example of the tulip using perfect hashing when
all the 16 entries are exhausted and the host still wants to subscribe
to 100 other multicast groups... should we put the nic into promisc
multicast?

> > And if this was a commodity feature - what happens to PACKET_HOST
> > setting? an netdevice can only have one unicast MAC address.
> > SIOCDEVPRIVATE does not seem to be the right place to do this.
> > You still wanna have ability to do proper RFC2338 and related protocols
> > even when the h/ware is incapable.
> > And btw, i didnt even open up the whole can of worms - we also need to
> > respond back with proper MAC addresses to ARPs and packets sourced with
> > specific virtual router IPs. This is a seprate problem.
>
> Yup, a whole can of worms if you want it to be a general feature handled
> by the kernel...
>

I think hacking the ARP code to do this would be horrible - one way to do
it is write a tc module that mungles outgoing ARPs to substitute the src
MAC address based on src IP.

> > PS: the hack credit (for using SIOCADDMULTI/DELMULTI) goes to Jerome
> > Ettiene (this is a guy who never responds to email, probably too busy
> > unicycling somewhere, so no point in ccing him) - Except it doesnt work
> > without the patch i posted.
>
> This has worked with the Tulip driver for many years.  I've pointed it
> out to a number of people that requested this as a new feature.
>

didnt know that. I guess i was lucky not to have used the tulip when i
got bitten by this. So, Donald, other than tulip what other NICs can do
this?

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  7:32           ` Jeff Garzik
@ 2003-01-04 17:43             ` jamal
  0 siblings, 0 replies; 21+ messages in thread
From: jamal @ 2003-01-04 17:43 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Donald Becker, Ben Greear, netdev



On Sat, 4 Jan 2003, Jeff Garzik wrote:

> I wonder if there are any good uses for more advanced RX filtering that
> is beginning to appear.  I could certainly imagine an interface that was
> a more generic RX filtering interface, and [just by accident] happened
> to support existing unicast and multicast rx-mode-related controls.
>
> As vendors stuff features onto cards and try to figure out where is the
> best dividing line between TCP stack acceleration and TCP stack offload,
> it seems to me that recent cards more often than not have nice RX
> filtering capabilities.  If you look at the world through GigE-colored
> glasses, the RX filtering picture gets even better.  There are some fun
> SMP implications with flexible enough RX filtering, for example.
>

Davem had some nifty ideas on this at least for TOE. I think we ought to
start looking at that direction. Anyone who is working on this please
involve me, i have some ideas (and it would avoid me stressing you
when you output code).

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04 17:41             ` jamal
@ 2003-01-04 18:24               ` Donald Becker
  2003-01-04 18:55                 ` jamal
  2003-01-04 18:36               ` Julian Anastasov
  1 sibling, 1 reply; 21+ messages in thread
From: Donald Becker @ 2003-01-04 18:24 UTC (permalink / raw)
  To: jamal; +Cc: Ben Greear, Jeff Garzik, netdev

On Sat, 4 Jan 2003, jamal wrote:
> On Sat, 4 Jan 2003, Donald Becker wrote:
> > On Fri, 3 Jan 2003, jamal wrote:
> > > Q: Is this a hack?
> > > A: Yes, indeed it is. wrong API is the main culprit.

I disagree: it's a hack usable by the very few people that need it.
Defining an API would add little-used complexity.

> > > How common are NICS such as the 21x4x that can be programmed to do
> > > perfect hashing and accept multiple MAC addresses in hardware?
> >
> > Not very common.  I mentioned the 21*4* explicitly because few other
> > common chips implement this feature.
> > The Digital design implemented it because of DECnet, which is long dead.
> 
> Side, unrelated question:
> for h/ware multicast filtered multicast addresses: What is the common
> practise to handle the corner case where a hardware multicast address
> is filled up.

This is good example of why an API is a bad idea: there is no general
case.  The Tulip has 16 filter addresses, but...
  Oh, not all chips handled by the Tulip driver.  Macronix does.  Asix
    doesn't. ADMtek doesn't.  PNIC chips do.
  You need one slot for the broadcast address on a few chip versions
    with a bug.  (I do this unconditionally.)
  You cannot use the multicast hash filter.

"Only in the month after the full moon falls on a Tuesday."

> Take an example of the tulip using perfect hashing when
> all the 16 entries are exhausted and the host still wants to subscribe
> to 100 other multicast groups... should we put the nic into promisc
> multicast?

Yes.  This is the DECnet case.

> didnt know that. I guess i was lucky not to have used the tulip when i
> got bitten by this. So, Donald, other than tulip what other NICs can do
> this?

Mostly gigabit Ethernet chips.  There are a few FE chips that allow
setting the multicast hash address to also filter physical addresses,
but imperfect filters result in the same extra cdoe as using promiscuous
mode.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04 17:41             ` jamal
  2003-01-04 18:24               ` Donald Becker
@ 2003-01-04 18:36               ` Julian Anastasov
  2003-01-04 19:04                 ` jamal
  1 sibling, 1 reply; 21+ messages in thread
From: Julian Anastasov @ 2003-01-04 18:36 UTC (permalink / raw)
  To: jamal; +Cc: Donald Becker, Ben Greear, Jeff Garzik, Alexandre Cassen, netdev


	Hello,

On Sat, 4 Jan 2003, jamal wrote:

> > > And btw, i didnt even open up the whole can of worms - we also need to
> > > respond back with proper MAC addresses to ARPs and packets sourced with
> > > specific virtual router IPs. This is a seprate problem.
> >
> > Yup, a whole can of worms if you want it to be a general feature handled
> > by the kernel...
> >
>
> I think hacking the ARP code to do this would be horrible - one way to do
> it is write a tc module that mungles outgoing ARPs to substitute the src
> MAC address based on src IP.

	You can do it with arptables (still not sure how) or with 
arprules+iparp:

# send all our requests from VRIP with VMAC
ip arp add table output from 1.2.3.4 llsrc 00:00:5E:00:01:10

http://www.ssi.bg/~ja/#iparp

	But this is not enough for VRRP. For Linux we need a
way to bind VRIPs to source VMACs or sort of this. I'm cc-ing to Alexandre
Cassen as he is working on VRRP (http://keepalived.sourceforge.net/).
I hope he can shed some light on the VRRP needs.

> cheers,
> jamal

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04 18:24               ` Donald Becker
@ 2003-01-04 18:55                 ` jamal
  0 siblings, 0 replies; 21+ messages in thread
From: jamal @ 2003-01-04 18:55 UTC (permalink / raw)
  To: Donald Becker; +Cc: Ben Greear, Jeff Garzik, netdev



On Sat, 4 Jan 2003, Donald Becker wrote:

> > > On Fri, 3 Jan 2003, jamal wrote:
> > > > Q: Is this a hack?
> > > > A: Yes, indeed it is. wrong API is the main culprit.
>
> I disagree: it's a hack usable by the very few people that need it.
> Defining an API would add little-used complexity.

Somehow i (and people using VRRP or trying to write HSRP like  apps)
need to have multiple MAC addresses accepted by a single NIC. It is
not really a science project reqnmt, rather needed by real-world apps
which are becoming more common. The way i see it is:"
a) Introduce new API
b) reuse an API - considered a hack really
c) maintain it as a separate patch; needs to be cleaned a little for
optimization (example not all NICS may need this extra per packet check).

Which one do you see as being reasonable?
You cant say none of the above because people need this feature ;->
You have to present an alternative at least.

>
> > Side, unrelated question:
> > for h/ware multicast filtered multicast addresses: What is the common
> > practise to handle the corner case where a hardware multicast address
> > is filled up.
>
> This is good example of why an API is a bad idea: there is no general
> case.  The Tulip has 16 filter addresses, but...
>   Oh, not all chips handled by the Tulip driver.  Macronix does.  Asix
>     doesn't. ADMtek doesn't.  PNIC chips do.

ok.

>   You need one slot for the broadcast address on a few chip versions
>     with a bug.  (I do this unconditionally.)

Make sense.

>   You cannot use the multicast hash filter.

Why not?

>
> "Only in the month after the full moon falls on a Tuesday."
>
> > Take an example of the tulip using perfect hashing when
> > all the 16 entries are exhausted and the host still wants to subscribe
> > to 100 other multicast groups... should we put the nic into promisc
> > multicast?
>
> Yes.  This is the DECnet case.

I dont think this is what we do today in general; does the tulip do this
really?

>
> > didnt know that. I guess i was lucky not to have used the tulip when i
> > got bitten by this. So, Donald, other than tulip what other NICs can do
> > this?
>
> Mostly gigabit Ethernet chips.  There are a few FE chips that allow
> setting the multicast hash address to also filter physical addresses,
> but imperfect filters result in the same extra cdoe as using promiscuous
> mode.
>

OK, i guess this explains the imperfect filters; so since Giges are
becoming such a commodity item, are you suggesting that since to do VRRP
get yourself a Gige card or appropriate FE card?

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04 18:36               ` Julian Anastasov
@ 2003-01-04 19:04                 ` jamal
  2003-01-05 11:45                   ` Julian Anastasov
  0 siblings, 1 reply; 21+ messages in thread
From: jamal @ 2003-01-04 19:04 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Donald Becker, Ben Greear, Jeff Garzik, Alexandre Cassen, netdev



On Sat, 4 Jan 2003, Julian Anastasov wrote:

>
> 	Hello,
>
> On Sat, 4 Jan 2003, jamal wrote:
>
> >
> > I think hacking the ARP code to do this would be horrible - one way to do
> > it is write a tc module that mungles outgoing ARPs to substitute the src
> > MAC address based on src IP.
>
> 	You can do it with arptables (still not sure how) or with

I havent seen user-space arptables around.

> arprules+iparp:
>
> # send all our requests from VRIP with VMAC
> ip arp add table output from 1.2.3.4 llsrc 00:00:5E:00:01:10
>
> http://www.ssi.bg/~ja/#iparp

I like this concept. This + the patch i posted should resolve the problem
of getting multiple VRIDs on a single interface.
[Although you could do it in a lot less code, maybe 50%, using
some of the tc filter extensions i am working on; also a lot less code
than arptables]

>
> 	But this is not enough for VRRP. For Linux we need a
> way to bind VRIPs to source VMACs or sort of this. I'm cc-ing to Alexandre
> Cassen as he is working on VRRP (http://keepalived.sourceforge.net/).
> I hope he can shed some light on the VRRP needs.
>

I guess there is more than one VRRP implementation on Linux; the one i
played with is from some crazy Frenchman named Jerome Ettiene;->
With two conecpts being addressed i.e patch like that you have +
the patch i posted i dont see any orther reason VRRP to be hindered.

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04 19:04                 ` jamal
@ 2003-01-05 11:45                   ` Julian Anastasov
  2003-01-06 13:44                     ` jamal
  0 siblings, 1 reply; 21+ messages in thread
From: Julian Anastasov @ 2003-01-05 11:45 UTC (permalink / raw)
  To: jamal; +Cc: Donald Becker, Ben Greear, Jeff Garzik, Alexandre Cassen, netdev


	Hello,

On Sat, 4 Jan 2003, jamal wrote:

> > 	You can do it with arptables (still not sure how) or with
>
> I havent seen user-space arptables around.

	yes, that is what I mean

> > http://www.ssi.bg/~ja/#iparp
>
> I like this concept. This + the patch i posted should resolve the problem
> of getting multiple VRIDs on a single interface.
> [Although you could do it in a lot less code, maybe 50%, using
> some of the tc filter extensions i am working on; also a lot less code
> than arptables]

	I hope there will be support for altering any bit
in the skb->head - skb->end area, even by using negative offsets
based on skb->nh.raw - this is needed for eth header manipulations.
May be sort of: ... alter andmask 0xFF00 xormask 0x0023 at -4 ...
i.e. syntax similar to ipchains TOS and u32 match.

	As for VRRP I see it in this way. Note that I'm not a VRRP
fan, I prefer the ARP methods for takeover, Of course, sometimes they
can not work due to the bad non-Linux ARP stack implementations.
As Alexandre noted once, the gratuitous ARP should not be slower
than VRRP talks. Only that there are bad ARP cache implementations.

1. if remote hosts asks for lladdr of VRIP tc should modify our
ARP reply: the SMAC in the eth header (using negative offset) and the
SMAC in the ARP header. This is analog to:
ip arp add to VRIP llsrc VMAC

2. if our IP stack sends packet with saddr=VRIP that leads to ARP
probe sent from our host then we should modify the packet in
the same way as (1). This is analog to:
ip arp add table output from VRIP llsrc VMAC

3. Replace the src MAC with proper VMAC for all IP packets with
saddr=VRIP. This can be a neighbouring code job but difficult to
implement there.

4. Not last: NIC should accept traffic for all VMACs (promisc
when attached to switched hubs is enough?) and eth_type_trans to maintain
list of MAC aliases. I'm not sure that such list/hashtable with MACs
should be attached per device - may be VRRP needs to announce one
MAC through different interfaces? Also think for the Bridging
code which calls eth_type_trans too.

5. Enough from one who don't like VRRP :)

> With two conecpts being addressed i.e patch like that you have +
> the patch i posted i dont see any orther reason VRRP to be hindered.

	Not sure, may be the only remaining is (3).

> cheers,
> jamal

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-05 11:45                   ` Julian Anastasov
@ 2003-01-06 13:44                     ` jamal
  2003-01-06 15:00                       ` Julian Anastasov
  0 siblings, 1 reply; 21+ messages in thread
From: jamal @ 2003-01-06 13:44 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Donald Becker, Ben Greear, Jeff Garzik, Alexandre Cassen, netdev



On Sun, 5 Jan 2003, Julian Anastasov wrote:

>
> 	Hello,
>
> On Sat, 4 Jan 2003, jamal wrote:
>
> > > 	You can do it with arptables (still not sure how) or with
> >
> > I havent seen user-space arptables around.
>
> 	yes, that is what I mean
>
> > > http://www.ssi.bg/~ja/#iparp
> >
> > I like this concept. This + the patch i posted should resolve the problem
> > of getting multiple VRIDs on a single interface.
> > [Although you could do it in a lot less code, maybe 50%, using
> > some of the tc filter extensions i am working on; also a lot less code
> > than arptables]
>
> 	I hope there will be support for altering any bit
> in the skb->head - skb->end area, even by using negative offsets
> based on skb->nh.raw - this is needed for eth header manipulations.
> May be sort of: ... alter andmask 0xFF00 xormask 0x0023 at -4 ...
> i.e. syntax similar to ipchains TOS and u32 match.

I wanted to use u32 as the basis; which means u32 type matching is needed.
then use vi/sed type substitution s/OL/V where:
O =  offset (from skb->data, could be -ve),
L = length (cant go beyond head or end),
V is a static value configured (its size cant exceed L). V can also
be computed off something example the data at offset O. I am trying to
keep away from situations where L is larger or smaller than sizeof V
so theres no mucking with any of the skb pointers ore reallocing etc. In
the next iteration things could change. Note i havent written this but
will in the near future (so anyone is welcome to hack on it)
I didnt understand your andmask and xormask idea...

>
> 	As for VRRP I see it in this way. Note that I'm not a VRRP
> fan, I prefer the ARP methods for takeover, Of course, sometimes they
> can not work due to the bad non-Linux ARP stack implementations.
> As Alexandre noted once, the gratuitous ARP should not be slower
> than VRRP talks. Only that there are bad ARP cache implementations.
>

yes, this is a big problem. But also in some complex multi-vlan switches
grat arps are not sufficient.

> 1. if remote hosts asks for lladdr of VRIP tc should modify our
> ARP reply: the SMAC in the eth header (using negative offset) and the
> SMAC in the ARP header. This is analog to:
> ip arp add to VRIP llsrc VMAC
>

I really like the brevity of the above;
equivalent for me would be (my longterm plan to move ingress to below
IP has finaly found an excuse)
tc filter add <DEV x> parent x:y protocol arp prio 10 u32 flowid x:z \
match sip VRIP action edit s/smac/VMAC action edit s/SMAC/VMAC

u32 needs to be taught about ARP so it can understand different
ARP header bits like sip (shouldnt be that difficult)

>
> 2. if our IP stack sends packet with saddr=VRIP that leads to ARP
> probe sent from our host then we should modify the packet in
> the same way as (1). This is analog to:
> ip arp add table output from VRIP llsrc VMAC
>

Dont see the difference between 1) and 2)

> 3. Replace the src MAC with proper VMAC for all IP packets with
> saddr=VRIP. This can be a neighbouring code job but difficult to
> implement there.

tc filter add <DEV x> parent x:y protocol ip prio 10 u32 flowid x:z \
match ip src VRIP action edit s/smac/VMAC

Did i understand this correctly?

>
> 4. Not last: NIC should accept traffic for all VMACs (promisc
> when attached to switched hubs is enough?) and eth_type_trans to maintain
> list of MAC aliases. I'm not sure that such list/hashtable with MACs
> should be attached per device - may be VRRP needs to announce one
> MAC through different interfaces? Also think for the Bridging
> code which calls eth_type_trans too.

I plan to move ingress to below IP just before the bridging and tap
code; experiments shows this works just fine.
So all the filters + edits going there should work fine. Thoughts?


cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-06 13:44                     ` jamal
@ 2003-01-06 15:00                       ` Julian Anastasov
  2003-01-06 17:23                         ` jamal
  0 siblings, 1 reply; 21+ messages in thread
From: Julian Anastasov @ 2003-01-06 15:00 UTC (permalink / raw)
  To: jamal; +Cc: Donald Becker, Ben Greear, Jeff Garzik, Alexandre Cassen, netdev


	Hello,

On Mon, 6 Jan 2003, jamal wrote:

> > May be sort of: ... alter andmask 0xFF00 xormask 0x0023 at -4 ...
> > i.e. syntax similar to ipchains TOS and u32 match.
>
> I wanted to use u32 as the basis; which means u32 type matching is needed.
> then use vi/sed type substitution s/OL/V where:
> O =  offset (from skb->data, could be -ve),

	IMO, using skb->nh.raw as basis is preferred. By this way
the filters can be used from different places in the net stack.

> L = length (cant go beyond head or end),
> V is a static value configured (its size cant exceed L). V can also
> be computed off something example the data at offset O. I am trying to
> keep away from situations where L is larger or smaller than sizeof V
> so theres no mucking with any of the skb pointers ore reallocing etc. In

	Yes, changing skb len is problematic mostly for TCP. As
for L and V: I assume they are HEX digits or there will be a way
to encode alphanumeric chars?

> the next iteration things could change. Note i havent written this but
> will in the near future (so anyone is welcome to hack on it)
> I didnt understand your andmask and xormask idea...

The above example:

- goto word at offset -4
- AND the 2 bytes with FF00
- XOR the 2 bytes with 0023

AND+XOR allow any operations for bits:

1) preserving (AND 1 XOR 0)
2) inverting (AND 1 XOR 1)
3) setting (AND 0 XOR 1)
4) clearing (AND 0 XOR 0)

> equivalent for me would be (my longterm plan to move ingress to below
> IP has finaly found an excuse)
> tc filter add <DEV x> parent x:y protocol arp prio 10 u32 flowid x:z \
> match sip VRIP action edit s/smac/VMAC action edit s/SMAC/VMAC
>
> u32 needs to be taught about ARP so it can understand different
> ARP header bits like sip (shouldnt be that difficult)

	Yes, we can teach u32 to know about ARP offsets,
ethhdr offsets...

> > ip arp add table output from VRIP llsrc VMAC
> >
>
> Dont see the difference between 1) and 2)

	No difference for tc, only iparp has this difference because
it follows the routing

> > 3. Replace the src MAC with proper VMAC for all IP packets with
> > saddr=VRIP. This can be a neighbouring code job but difficult to
> > implement there.
>
> tc filter add <DEV x> parent x:y protocol ip prio 10 u32 flowid x:z \
> match ip src VRIP action edit s/smac/VMAC
>
> Did i understand this correctly?

	Yes

> > MAC through different interfaces? Also think for the Bridging
> > code which calls eth_type_trans too.
>
> I plan to move ingress to below IP just before the bridging and tap
> code; experiments shows this works just fine.
> So all the filters + edits going there should work fine. Thoughts?

	I assume just after skb->nh.raw = skb->data;
Also, before or after deliver to ptype_all?

	I see one problem: egress is called after all csum calcs,
bad for IP (if tc is going to damage the payload), good for ethhdr, ARP.

> cheers,
> jamal

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-04  1:52       ` Jeff Garzik
@ 2003-01-06 15:00         ` Krzysztof Halasa
  0 siblings, 0 replies; 21+ messages in thread
From: Krzysztof Halasa @ 2003-01-06 15:00 UTC (permalink / raw)
  To: netdev

Jeff Garzik <jgarzik@pobox.com> writes:

> No!  procfs additions are discouraged.  sysfs in 2.5.x if you _must_
> do this, but SIOCDEVPRIVATE or just flat out maintaining a kernel
> patch against a stable kernel tree would be much preferred, I think.

Still, SIOCDEVPRIVATE should _not_, in my opinion, be used for anything
but hacks.
For example, we should stop using it for configuring ethernet bridges:

net/bridge/br_device.c: if (cmd != SIOCDEVPRIVATE)
-- 
Krzysztof Halasa
Network Administrator

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SIOCADDMULTI for unicast broken
  2003-01-06 15:00                       ` Julian Anastasov
@ 2003-01-06 17:23                         ` jamal
  0 siblings, 0 replies; 21+ messages in thread
From: jamal @ 2003-01-06 17:23 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Donald Becker, Ben Greear, Jeff Garzik, Alexandre Cassen, netdev



Julian,

On Mon, 6 Jan 2003, Julian Anastasov wrote:

>
> 	Hello,
>
> On Mon, 6 Jan 2003, jamal wrote:
>
> > > May be sort of: ... alter andmask 0xFF00 xormask 0x0023 at -4 ...
> > > i.e. syntax similar to ipchains TOS and u32 match.
> >
> > I wanted to use u32 as the basis; which means u32 type matching is needed.
> > then use vi/sed type substitution s/OL/V where:
> > O =  offset (from skb->data, could be -ve),
>
> 	IMO, using skb->nh.raw as basis is preferred. By this way
> the filters can be used from different places in the net stack.

Makes sense, let me think about it more and maybe experiment.
"different places in the stack"? I was thinking only ingress or egress.

>
> > L = length (cant go beyond head or end),
> > V is a static value configured (its size cant exceed L). V can also
> > be computed off something example the data at offset O. I am trying to
> > keep away from situations where L is larger or smaller than sizeof V
> > so theres no mucking with any of the skb pointers ore reallocing etc. In
>
> 	Yes, changing skb len is problematic mostly for TCP. As
> for L and V: I assume they are HEX digits or there will be a way
> to encode alphanumeric chars?
>

We leave that to user space. User space can translate from strings for
example into hex. But yes, hex should be the default as is now.

> > the next iteration things could change. Note i havent written this but
> > will in the near future (so anyone is welcome to hack on it)
> > I didnt understand your andmask and xormask idea...
>
> The above example:
>
> - goto word at offset -4
> - AND the 2 bytes with FF00
> - XOR the 2 bytes with 0023
>
> AND+XOR allow any operations for bits:
>
> 1) preserving (AND 1 XOR 0)
> 2) inverting (AND 1 XOR 1)
> 3) setting (AND 0 XOR 1)
> 4) clearing (AND 0 XOR 0)
>

Ok, makes sense and oughta be used.

> > u32 needs to be taught about ARP so it can understand different
> > ARP header bits like sip (shouldnt be that difficult)
>
> 	Yes, we can teach u32 to know about ARP offsets,
> ethhdr offsets...

ethheaders are i think generic enough to be done first.

>
> > I plan to move ingress to below IP just before the bridging and tap
> > code; experiments shows this works just fine.
> > So all the filters + edits going there should work fine. Thoughts?
>
> 	I assume just after skb->nh.raw = skb->data;
> Also, before or after deliver to ptype_all?

yes before ptype_all i.e the first thing. But it could be a
programmability thing where you at least allow ptype_all to see things.

>
> 	I see one problem: egress is called after all csum calcs,
> bad for IP (if tc is going to damage the payload), good for ethhdr, ARP.
>

well unless you pass options to recalculate csums or make csum an
action by itself. I am thinking of "auxillary" actions which could
be done by all actions - this could be one of them; logging is another;
"change classid" is another.
Lets take this discussion offline, i think we have exceeded the topic
that started it all and people are hitting "D"s.

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2003-01-06 17:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-03 21:46 SIOCADDMULTI for unicast broken jamal
2003-01-04  0:07 ` Donald Becker
2003-01-04  1:18 ` Jeff Garzik
2003-01-04  1:39   ` Donald Becker
2003-01-04  1:45     ` Ben Greear
2003-01-04  1:52       ` Jeff Garzik
2003-01-06 15:00         ` Krzysztof Halasa
2003-01-04  2:18       ` Donald Becker
2003-01-04  4:11         ` jamal
2003-01-04  6:33           ` Donald Becker
2003-01-04 17:41             ` jamal
2003-01-04 18:24               ` Donald Becker
2003-01-04 18:55                 ` jamal
2003-01-04 18:36               ` Julian Anastasov
2003-01-04 19:04                 ` jamal
2003-01-05 11:45                   ` Julian Anastasov
2003-01-06 13:44                     ` jamal
2003-01-06 15:00                       ` Julian Anastasov
2003-01-06 17:23                         ` jamal
2003-01-04  7:32           ` Jeff Garzik
2003-01-04 17:43             ` jamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).