From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Borntraeger Subject: Re: [PATCH] ipv6: alternative version of S/390 shared NIC support Date: Wed, 19 Jan 2005 21:52:23 +0100 Message-ID: <200501192152.23865.cborntra@de.ibm.com> References: <20050116115431.GA13617@lst.de> <200501181653.59041.christian@borntraeger.net> <1106142567.1049.970.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Cc: Christoph Hellwig , "David S. Miller" , pavlic@de.ibm.com, waldi@debian.org, netdev@oss.sgi.com Return-path: To: hadi@cyberus.ca In-Reply-To: <1106142567.1049.970.camel@jzny.localdomain> Content-Disposition: inline Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org jamal wrote: [...] > > > Can you provision multiple of these cards per VM? if yes, is there > > > some ID that will break it down to OSInstance:cardid? > > You did not answer this question. > Let me draw a diagram to show what i think the hierachy is: > > Physical Card: MAC address X > > > +--- OSInstance A > > | +-- "CARD" with IP A > | +-- "CARD" with IP B > | +-- "CARD" with IP C > | +-- "CARD" with IP D > > . > . > . > > +--- OSInstance N > > +-- "CARD" with IP Z > > > Is the above reflective of what happens? No. Its in the simplest case a one level hierarchy, but always outside Linux. Just for demonstration purposes, just IMAGINE, that a cards has only one device address (devno): -----net-------\ || |hardware | \------------------/ | | | | | | | /--devno---/ | | | | | | | | . . . . | | | | Linux1 Linux2 ....... Linuxn > In other words, packet comes from the wire (with MAC address X); somehow > the hypervisor(?) or firmware figures based on IP address A (assuming no > other instance has that IP) it has to send packet to OSInstanceA. Right. Basically, the card has a table: DEVNO | IP address | ROUTER? 0-ffff | .... | primary, secondary If then a packet arrives it has an IP address. The card looks this address up in this table: | | V is it registered ----yes----> forward to the device number | | V is there a primary router? ----yes----> forward to device number of the | the primary router | | V is there a secondary router? ----yes----> forward to device number of the | the secondary router | V drop packet. > OSInstanceA then selects further the CARD based on something probably in > a descriptor? No. The packet arrives on one "CARD" and Linux gets an "interrupt" and fetches the packet from that card. You can of course have more cards per linux by providing more device numbers in the profile of the virtual machine. But that makes no sense unless you do some testing or the guest operating system is a hypervisor itself and can dispatch the device addresses to its virtual machines. > > Let me get to the point: > I think it would make sense for the "CARD" to be just another netdevice > (call it "card" netdevice for this discussion). > The representation of the physical card in the OSInstance is also a > netdevice(call it physical netdevice for this discussion) as it is now > (excpet it has no IP address ever). > The "card netdevices" are stacked on top of the physical netdevice. This > would be like an upside down bridge stacking relationship of > netdevices.... > It actually is no different from a few tunnel netdevices that sit on top > of say eth0 or multiple PPP devices on top of ethx in a PPPOE > relationship. > The demuxing for incoming packets is done at physical card netdevice > to select the "card" netdevice whose receive method is then called. > Reverse direction for transmit (we could go into details later, just > wanna make sure this is sensible to begin with). If I understood you right, you think that Linux has any control of the real hardware device and the card itself is created in the qeth linux driver as some obscure thing. No, for Linux the device backed by the 3 device addresses is as real as you and me. The qeth driver calls alloc_etherdev for the device represented by the 3 device addresses and, therefore, creating a netdevice for this card. Thats all we have. A card that only speaks IPv4 and IPv6 with us and that handles ARP for magically. To make it worse, the card only cooperates if we tell the card, which IP addresses we want to use. (by the way, the card recognizes duplicates) > Does this sound reasonable? If yes, then if you do this you wont need to > hack anything like IPV6 etc in your driver - they become merely > netdevices. It should also allow for all standard features like ifconfig > up/down etc of the "card" and setting IP addresses, VLANS etc to work as > is. And you wont need to put any speacilized code in the driver. > If its off tangent, then i just wasted 1/2 a cup of coffee energy typing > away ;-> Well, I think you havent been fully aware of the way the card works. I hope my explanation above makes it clearer. In one sentence: our Linux qeth "CARDS" _are_ already netdevices. (We are speaking about struct net_device, right?) Furthermore, as long as there is only one MAC address we need one hack for IPv6, and Christophs version looks very nice and elegant. > > > Right, without registering the IP address, you can not receive any > > packet. > > If this is firmware issue, it would be wise to fix it. You should be > able to register multiple MAC addresses hidden in the firmware (not at > the Linux level) and have your "cards" netdevice use them. i.e the > "card" netdevices would own those. Well,newer cards support something like this (see the hardware announcement) http://www-306.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=an&subtype=ca&supplier=897&appname=IBMLinkRedirect&letternum=ENUS104-346 ) its called Layer 2 support. Unfortunately this feature is only available on a subset of cards and newer zSeries machines. Just for explanation, there is a good reason that you only see the packets for your IP address: scalability of virtualization. Imagine you have a gigabit card shared among 80 Linuxes (quite realistic numbers). If you dont filter early, you have to forward all traffic to every guest system (which then can discard unneeded packages). Then you have to provide an internal bandwith of 80 Gbit/sec. Not a good idea. Therefore you have to multiplex very very early and cannot set every Linux in promiscous mode for performance reasons. Furthermore, it is not that easy to change the way the card works. There are other operating systems like z/OS z/VSE or z/VM running on the same hardware. You dont hastily change the behaviour of hardware, firmware and operating systems which runs in lots of banks and insurance companies with 99.999% availability without making sure that everything works flawlessly. > > As the logical network interface has no own MAC address you actually > > speak IP to the card. That also means, that without some additional > > effort, tools like tcpdump fail and you need some patches in the dhcp > > tools. > > Refer to above. If you actually have your virtual/"card" netdevice on > top of the physical netdevice have a MAC address, then all tools should > work as is with zero changes IMO. Right. Buts that just not the way the hardware works. Pick almost any random linux driver to see ugly code to cicumvent a hardware limitation. Have a look at the amount of traps in the kernel to fix cpu "errata". There is no perfect hardware. > > You can define options for routers to get more than your own packages, > > but IIRC you can only define a primary and secondary router per port. > > Ok, this is new information - what does a "router" mean in relation to a > port? see above. You can have one primary and one secondary router. I hope that helps to understand the problem. If not, please dont hesitate to ask. Christian