anyone ever done multicast AF

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* anyone ever done multicast AF_UNIX sockets?
@ 2003-02-27 20:09 Chris Friesen
  2003-02-27 22:21 ` Greg Daley
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Chris Friesen @ 2003-02-27 20:09 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-net

It is fairly common to want to distribute information between a single 
sender and multiple receivers on a single box.

Multicast IP sockets are one possibility, but then you have additional 
overhead in the IP stack.

Unix sockets are more efficient and give notification if the listener is 
not present, but the problem then becomes that you must do one syscall 
for each listener.

So, here's my main point--has anyone ever considered the concept of 
multicast AF_UNIX sockets?

The main features would be:
--ability to associate/disassociate a socket with a multicast address
--ability to associate/disassociate with all multicast addresses 
(possibly through some kind of raw socket thing, or maybe a simple 
wildcard multicast address)
--on process death all sockets owned by that process are disassociated 
from any multicast addresses that they were associated with
--on sending a packet to a multicast address and there are no sockets 
associated with it, return -1 with errno=ECONNREFUSED

The association/disassociation could be done using the setsockopt() 
calls the same as with udp sockets, everything else would be the same 
from a userspace perspective.

Any thoughts?  How hard would this be to put in?

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-02-27 20:09 anyone ever done multicast AF_UNIX sockets? Chris Friesen
@ 2003-02-27 22:21 ` Greg Daley
  2003-02-28 13:33 ` jamal
  2003-03-03 12:51 ` Terje Eggestad
  2 siblings, 0 replies; 24+ messages in thread
From: Greg Daley @ 2003-02-27 22:21 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net

Hi Chris,

Please check out the uml_switch
written by jeff dike for Mser Mode Linux.

It is a user-space program which emultates
an ethernet switch (or hub).  It emulates
link-layer multicast on UNIX domain sockets.

Greg Daley

Chris Friesen wrote:
> 
> It is fairly common to want to distribute information between a single 
> sender and multiple receivers on a single box.
> 
> Multicast IP sockets are one possibility, but then you have additional 
> overhead in the IP stack.
> 
> Unix sockets are more efficient and give notification if the listener is 
> not present, but the problem then becomes that you must do one syscall 
> for each listener.
> 
> So, here's my main point--has anyone ever considered the concept of 
> multicast AF_UNIX sockets?
> 
> The main features would be:
> --ability to associate/disassociate a socket with a multicast address
> --ability to associate/disassociate with all multicast addresses 
> (possibly through some kind of raw socket thing, or maybe a simple 
> wildcard multicast address)
> --on process death all sockets owned by that process are disassociated 
> from any multicast addresses that they were associated with
> --on sending a packet to a multicast address and there are no sockets 
> associated with it, return -1 with errno=ECONNREFUSED
> 
> The association/disassociation could be done using the setsockopt() 
> calls the same as with udp sockets, everything else would be the same 
> from a userspace perspective.
> 
> Any thoughts?  How hard would this be to put in?
> 
> Chris
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-02-27 20:09 anyone ever done multicast AF_UNIX sockets? Chris Friesen
  2003-02-27 22:21 ` Greg Daley
@ 2003-02-28 13:33 ` jamal
  2003-02-28 14:39   ` Chris Friesen
  2003-03-03 12:51 ` Terje Eggestad
  2 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-02-28 13:33 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net

On Thu, 27 Feb 2003, Chris Friesen wrote:

>
> It is fairly common to want to distribute information between a single
> sender and multiple receivers on a single box.
>
> Multicast IP sockets are one possibility, but then you have additional
> overhead in the IP stack.
>

I think this is a _very weak_  reason.
Without addressing any of your other arguements, can you describe what
such painful overhead you are talking about? Did you do any measurements
and under what circumstances are unix sockets vs say localhost bound
udp sockets are different? I am not looking for hand waving reason of
"but theres an IP stack".

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-02-28 13:33 ` jamal
@ 2003-02-28 14:39   ` Chris Friesen
  2003-03-01  3:18     ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Friesen @ 2003-02-28 14:39 UTC (permalink / raw)
  To: jamal; +Cc: linux-kernel, netdev, linux-net

jamal wrote:
> 
> On Thu, 27 Feb 2003, Chris Friesen wrote:

>>It is fairly common to want to distribute information between a single
>>sender and multiple receivers on a single box.

>>Multicast IP sockets are one possibility, but then you have additional
>>overhead in the IP stack.

> I think this is a _very weak_  reason.
> Without addressing any of your other arguements, can you describe what
> such painful overhead you are talking about? Did you do any measurements
> and under what circumstances are unix sockets vs say localhost bound
> udp sockets are different? I am not looking for hand waving reason of
> "but theres an IP stack".

 From lmbench local communication tests:

This is a multiproc 1GHz G4
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                         ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
pcary0z0. Linux 2.4.18- 0.600 3.756 6.58  10.2  26.4  13.8  36.9 599K
pcary0z0. Linux 2.4.18- 0.590 3.766 6.43  10.1  26.7  13.9  37.2 59.1


This is a 400MHz uniproc G4
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                         ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
zcarm0pd. Linux 2.2.17- 1.710 9.888 21.3  26.4  59.4  43.0 105.4 146.
zcarm0pd. Linux 2.2.17- 1.740 9.866 22.2  26.3  60.4  43.1 106.7 147.

This is a 1.8GHz P4
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                         ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
pcard0ks. Linux 2.4.18- 1.740  10.4 15.9  20.1  33.1  23.5  44.3 72.7
pcard0ks. Linux 2.4.18-        10.3 16.1  19.8  36.3  22.8  43.6 74.1
pcard0ks. Linux 2.4.18- 1.560  10.6 16.0  23.4  38.1  36.1  44.6 77.4


 From these numbers, UDP has 18%-44% higher latency than AF_UNIX, with 
the difference going up as the machine speed goes up.

Aside from that, IP multicast doesn't seem to work properly.  I enabled 
multicast on lo and disabled it on eth0, and a ping to 224.0.0.1 still 
got responses from all the multicast-capable hosts on the network.  From 
userspace, multicast unix would be *simple* to use, as in totally 
transparent.

The other reason why I would like to see this happen is that it just 
makes *sense*, at least to me.  We've got multicast IP, so multicast 
unix for local machine access is a logical extension in my books.

Do we agree at least that some form of multicast is the logical solution 
to the case of one sender/many listeners?

Thanks for your thoughts,

Chris





-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-02-28 14:39   ` Chris Friesen
@ 2003-03-01  3:18     ` jamal
  2003-03-02  6:03       ` Chris Friesen
  0 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-03-01  3:18 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net



On Fri, 28 Feb 2003, Chris Friesen wrote:

>  From lmbench local communication tests:
>
> This is a multiproc 1GHz G4
> Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
>                          ctxsw       UNIX         UDP         TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> pcary0z0. Linux 2.4.18- 0.600 3.756 6.58  10.2  26.4  13.8  36.9 599K
> pcary0z0. Linux 2.4.18- 0.590 3.766 6.43  10.1  26.7  13.9  37.2 59.1
>
>
> This is a 400MHz uniproc G4
> Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
>                          ctxsw       UNIX         UDP         TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> zcarm0pd. Linux 2.2.17- 1.710 9.888 21.3  26.4  59.4  43.0 105.4 146.
> zcarm0pd. Linux 2.2.17- 1.740 9.866 22.2  26.3  60.4  43.1 106.7 147.
>
> This is a 1.8GHz P4
> Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
>                          ctxsw       UNIX         UDP         TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> pcard0ks. Linux 2.4.18- 1.740  10.4 15.9  20.1  33.1  23.5  44.3 72.7
> pcard0ks. Linux 2.4.18-        10.3 16.1  19.8  36.3  22.8  43.6 74.1
> pcard0ks. Linux 2.4.18- 1.560  10.6 16.0  23.4  38.1  36.1  44.6 77.4
>
>
>  From these numbers, UDP has 18%-44% higher latency than AF_UNIX, with
> the difference going up as the machine speed goes up.
>

Did you also measure throughput?
You are overlooking the flexibility that already exists in IP based
transports as an advantage; the fact that you can make them
distributed instead of localized with a simple addressing change
is a very powerful abstraction.


> Aside from that, IP multicast doesn't seem to work properly.  I enabled
> multicast on lo and disabled it on eth0, and a ping to 224.0.0.1 still
> got responses from all the multicast-capable hosts on the network.

I think you may have something misconfigured.

> From
> userspace, multicast unix would be *simple* to use, as in totally
> transparent.
>

You could implement the abstraction in user space as a library today by
having some server that muxes to several registered clients.

> The other reason why I would like to see this happen is that it just
> makes *sense*, at least to me. We've got multicast IP, so multicast
> unix for local machine access is a logical extension in my books.
>

So whats the addressing scheme for multicast unix? Would it be a
reserved path?
I am actually indifferent: You could do this in user space for starters.
See if it buys you anything. Maybe you could do somethign clever with
passing unix file descriptors around to avoid a single server point of
failure etc.

> Do we agree at least that some form of multicast is the logical solution
> to the case of one sender/many listeners?
>

Thats what mcast definition is. You need to weigh your options; cost is
probably worth the flexibility you get with sockets.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-01  3:18     ` jamal
@ 2003-03-02  6:03       ` Chris Friesen
  2003-03-02 14:11         ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Friesen @ 2003-03-02  6:03 UTC (permalink / raw)
  To: jamal; +Cc: linux-kernel, netdev, linux-net

jamal wrote:
> Did you also measure throughput?

No.  lmbench doesn't appear to test UDP socket local throughput.

> You are overlooking the flexibility that already exists in IP based
> transports as an advantage; the fact that you can make them
> distributed instead of localized with a simple addressing change
> is a very powerful abstraction.

True.  On the other hand, the same could be said about unicast IP 
sockets vs unix sockets.  Unix sockets exist for a reason, and I'm 
simply proposing to extend them.

>>From
>>userspace, multicast unix would be *simple* to use, as in totally
>>transparent.

> You could implement the abstraction in user space as a library today by
> having some server that muxes to several registered clients.

This is what we have now, though with a suboptimal solution (we 
inherited it from another group).  The disadvantage with this is that it 
adds a send/schedule/receive iteration.  If you have a small number of 
listeners this can have a large effect percentage-wise on your messaging 
cost.  The kernel approach also cuts the number of syscalls required by 
a factor of two compared to the server-based approach.

> So whats the addressing scheme for multicast unix? Would it be a
> reserved path?

Actually I was thinking it could be arbitrary, with a flag in the unix 
part of struct sock saying that it was actually a multicast address. 
The api would be something like the IP multicast one, where you get and 
bind a normal socket and then use setsockopt to attach yourself to one 
or more of multicast addresses.  A given address could be multicast or 
not, but they would reside in the same namespace and would collide as 
currently happens.  The only way to create a multicast address would be 
the setsockopt call--if the address doesn't already exist a socket would 
be created by the kernel and bound to the desired address.

To see if its feasable I've actually coded up a proof-of-concept that 
seems to do fairly well. I tested it with a process sending an 8-byte 
packet containing a timestamp to three listeners, who checked the time 
on receipt and printed out the difference.

For comparison I have two different userspace implementations, one with 
a server process (very simple for test purposes) and the other using an 
mmap'd file to store which process is listening to what messages.

The timings (in usec) for the delays to each of the listeners were as 
follows on my duron 750:

userspace server:     104 133 153
userspace no server:   72 111 138
kernelspace:           60  91 113

As you can see, the kernelspace code is the fastest and since its in the 
kernel it can be written to avoid being scheduled out while holding 
locks which is hard to avoid with the no-server userspace option.

If this sounds at all interesting I would be glad to post a patch so you 
could shoot holes in it, otherwise I'll continue working on it privately.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-02  6:03       ` Chris Friesen
@ 2003-03-02 14:11         ` jamal
  2003-03-03 18:02           ` Chris Friesen
  0 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-03-02 14:11 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net



On Sun, 2 Mar 2003, Chris Friesen wrote:

> jamal wrote:
> > Did you also measure throughput?
>
> No.  lmbench doesn't appear to test UDP socket local throughput.

I think you need to collect all data if you are trying to show
improvements.

>
> > You are overlooking the flexibility that already exists in IP based
> > transports as an advantage; the fact that you can make them
> > distributed instead of localized with a simple addressing change
> > is a very powerful abstraction.
>
> True.  On the other hand, the same could be said about unicast IP
> sockets vs unix sockets.  Unix sockets exist for a reason, and I'm
> simply proposing to extend them.
>

You are treading into areas where unix sockets make less sense compared to
sockets. Good design rules (should actually read "lazy design
rules") ometimes you gotta move to a round peg instead of trying to make
the square one round.

> > You could implement the abstraction in user space as a library today by
> > having some server that muxes to several registered clients.
>
> This is what we have now, though with a suboptimal solution (we
> inherited it from another group).  The disadvantage with this is that it
> adds a send/schedule/receive iteration.  If you have a small number of
> listeners this can have a large effect percentage-wise on your messaging
> cost.  The kernel approach also cuts the number of syscalls required by
> a factor of two compared to the server-based approach.
>

Ok, so its only a problem when you have a few listeners i.e user space
scheme scales just fine as you keep adding listeners.
In your tests what was the break-even point?

> > So whats the addressing scheme for multicast unix? Would it be a
> > reserved path?
>
> Actually I was thinking it could be arbitrary, with a flag in the unix
> part of struct sock saying that it was actually a multicast address.
> The api would be something like the IP multicast one, where you get and
> bind a normal socket and then use setsockopt to attach yourself to one
> or more of multicast addresses.  A given address could be multicast or
> not, but they would reside in the same namespace and would collide as
> currently happens.  The only way to create a multicast address would be
> the setsockopt call--if the address doesn't already exist a socket would
> be created by the kernel and bound to the desired address.
>

Addressing has to be backwared compatible i.e not affecting any other
program.

> To see if its feasable I've actually coded up a proof-of-concept that
> seems to do fairly well. I tested it with a process sending an 8-byte
> packet containing a timestamp to three listeners, who checked the time
> on receipt and printed out the difference.
>
> For comparison I have two different userspace implementations, one with
> a server process (very simple for test purposes) and the other using an
> mmap'd file to store which process is listening to what messages.
>
> The timings (in usec) for the delays to each of the listeners were as
> follows on my duron 750:
>
> userspace server:     104 133 153
> userspace no server:   72 111 138
> kernelspace:           60  91 113
>
> As you can see, the kernelspace code is the fastest and since its in the
> kernel it can be written to avoid being scheduled out while holding
> locks which is hard to avoid with the no-server userspace option.
>

Actually, the difference between user space server and kernel doesnt
appear that big. What you need to do is collect more data.
repeat with incrementing number of listeners.

> If this sounds at all interesting I would be glad to post a patch so you
> could shoot holes in it, otherwise I'll continue working on it privately.
>

no rush, lets see your test data first and then you gotta do a better
sales job on the cost/benefit/flexibilty ratios.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-02 14:11         ` jamal
@ 2003-03-03 18:02           ` Chris Friesen
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Friesen @ 2003-03-03 18:02 UTC (permalink / raw)
  To: jamal; +Cc: linux-kernel, netdev, linux-net

jamal wrote:
> On Sun, 2 Mar 2003, Chris Friesen wrote
>>jamal wrote
>>>Did you also measure throughput

>>No.  lmbench doesn't appear to test UDP socket local throughput

> I think you need to collect all data if you are trying to show
> improvements.

I'll look at how they were measuring unix socket throughput and try 
implementing something similar for UDP.  It's not clear to me how to 
really measure throughput in a multicast environment though since it 
depends very much on your application messaging patterns.

> Ok, so its only a problem when you have a few listeners i.e user space
> scheme scales just fine as you keep adding listeners.
> In your tests what was the break-even point?

See below for more detailed test results.

> Addressing has to be backwared compatible i.e not affecting any other
> program.

Of course.  The way I've designed it is that you get and bind() a socket 
as normal, and then use setsockopt() to register interest in a multicast 
address (same as IP multicast).  If the address already exists but is 
not a multicast address, then you get an error.  If a socket tries to 
bind() or connect() to an existing multicast address, you get an error. 
  The different types of addresses exist in the same address space, but 
the only way to register interest in multicast addresses is through 
setsockopt().

>>The timings (in usec) for the delays to each of the listeners were as
>>follows on my duron 750:
>>
>>userspace server:     104 133 153
>>userspace no server:   72 111 138
>>kernelspace:           60  91 113

> Actually, the difference between user space server and kernel doesnt
> appear that big. What you need to do is collect more data.
> repeat with incrementing number of listeners.

What would you consider a "big" difference?  Here the userspace server 
is 35% slower than the kernelspace version.

You wanted more data, so here's results comparing the no-server 
userspace method vs the kernel method.  The server-based one would be 
slightly more expensive than the no-server version.  The results below 
are the smallest and largest latencies (in usecs) for the message to 
reach the listeners in userspace.  I've used three different sizes, the 
two extremes and a roughly average sized message in my particular domain.

44bytes
# listeners    userspace         kernelspace
10              73,335             103,252
20              72,610             106,429
50              74,1482            205,1301
100             76,3000            362,3425
200                                737,9917

236bytes
# listeners    userspace         kernelspace
10              70,346               81,265
20              74,639              122,468
50              75,1557             230,1421
100             80,3107             408,3743

40036-byte message
# listeners    userspace         kernelspace
10             302,4181           322,1692
20             303,7491           347,3450
50             306,10451          483,8394
100            309,23107          697,17061
200            313,45528          997,39810

As one would expect, the initial latencies are somewhat higher for the 
kernel space solution since all the skb header duplication is done 
before anyone is woken up.  One thing that I did not expect was the 
increased max latency in the kernel space soltion when the number of 
listeners grew large.  On reflection, however, I suspect that this is 
due to scheduler load since all of the listening processes have become 
runnable while in the userspace version they become runnable one at a 
time.  It would be interesting to run this on 2.5 with the O(1) 
scheduler and see if it makes a difference.

With larger message sizes, the cost of the additional copies in the 
userspace solution start to outweigh the overhead of the additional 
runnable processes and the kernel space solution stays faster in all 
runs tested.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-02-27 20:09 anyone ever done multicast AF_UNIX sockets? Chris Friesen
  2003-02-27 22:21 ` Greg Daley
  2003-02-28 13:33 ` jamal
@ 2003-03-03 12:51 ` Terje Eggestad
  2003-03-03 12:35   ` David S. Miller
  2003-03-03 17:09   ` Chris Friesen
  2 siblings, 2 replies; 24+ messages in thread
From: Terje Eggestad @ 2003-03-03 12:51 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net


On a single box you would use a shared memory segment to do this. It has
the following advantages:
- no syscalls at all
- whenever the recipients need to use the info, they access the shm
directly (you may need to use a semaphore to enforce consistency, or if
you're really pressed on time, spin lock a shm location) There is no
need for the recipients to copy the info to private data structs. 
- there is no need for the recipients to waste cycles on processing an
update
- you KNOW that all the recipients has "updated" at the same time.



That aside, you idea of being notified when the listener (peer) is not
there is pretty hopless when it comes to multicasts. 

Why does it help you to know that there are no recipients contra the
wrong number recipients ???? OR asked differently, if you don't have a
notion of who the recipients are/should be, why would you care if there
are none??????
There are practically no real applications for this feature. 


If you really want to get to know that a recipient disappeared,  use 
a stream socket to each recipients, and to keep the # of syscalls down,
get the aio patch, and do the send to all with a single lio_listio()
call. 


Also: Keep in mind that either you do multicast, or explisit send to
all, the data you're sending are copied from you buffer to the dest
sockets recv buffers anyway. If you're sending 1k you need somewhere
between 250 to 1000 cycles to do the copy, depending on alignment. I've
measured the syscall overhead for a write(len=0) to be about 800 cycles
on a P3 or athlon, and about 2000 on P4. If you really have enough
possible recipients, you should use a shm segment instead. If you have
only a few (~10) the overhead is worst case 20000 cycles, or on a 2G P4,
10 microsecs to do a syscall for each. Who cares... 


TJ


On Thu, 2003-02-27 at 21:09, Chris Friesen wrote:
> It is fairly common to want to distribute information between a single 
> sender and multiple receivers on a single box.
> 
> Multicast IP sockets are one possibility, but then you have additional 
> overhead in the IP stack.
> 
> Unix sockets are more efficient and give notification if the listener is 
> not present, but the problem then becomes that you must do one syscall 
> for each listener.
> 
> So, here's my main point--has anyone ever considered the concept of 
> multicast AF_UNIX sockets?
> 
> The main features would be:
> --ability to associate/disassociate a socket with a multicast address
> --ability to associate/disassociate with all multicast addresses 
> (possibly through some kind of raw socket thing, or maybe a simple 
> wildcard multicast address)
> --on process death all sockets owned by that process are disassociated 
> from any multicast addresses that they were associated with
> --on sending a packet to a multicast address and there are no sockets 
> associated with it, return -1 with errno=ECONNREFUSED
> 
> The association/disassociation could be done using the setsockopt() 
> calls the same as with udp sockets, everything else would be the same 
> from a userspace perspective.
> 
> Any thoughts?  How hard would this be to put in?
> 
> Chris
-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 12:51 ` Terje Eggestad
@ 2003-03-03 12:35   ` David S. Miller
  2003-03-03 17:09   ` Chris Friesen
  1 sibling, 0 replies; 24+ messages in thread
From: David S. Miller @ 2003-03-03 12:35 UTC (permalink / raw)
  To: terje.eggestad; +Cc: cfriesen, linux-kernel, netdev, linux-net

   From: Terje Eggestad <terje.eggestad@scali.com>
   Date: 03 Mar 2003 13:51:17 +0100

   On a single box you would use a shared memory segment to do this.

Thank you for applying real brains to this problem :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 12:51 ` Terje Eggestad
  2003-03-03 12:35   ` David S. Miller
@ 2003-03-03 17:09   ` Chris Friesen
  2003-03-03 16:55     ` David S. Miller
  2003-03-03 19:39     ` Terje Eggestad
  1 sibling, 2 replies; 24+ messages in thread
From: Chris Friesen @ 2003-03-03 17:09 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: linux-kernel, netdev, linux-net, davem

Terje Eggestad wrote:
> On a single box you would use a shared memory segment to do this. It has
> the following advantages:
> - no syscalls at all

Unless you poll for messages on the receiving side, how do you trigger 
the receiver to look for a message?  Shared memory doesn't have file 
descriptors.

> - whenever the recipients need to use the info, they access the shm
> directly (you may need to use a semaphore to enforce consistency, or if
> you're really pressed on time, spin lock a shm location) There is no
> need for the recipients to copy the info to private data structs.

How do they know the information has changed?  Suppose one process 
detects that the ethernet link has dropped.  How does it alert other 
processes which need to do something?

> Why does it help you to know that there are no recipients contra the
> wrong number recipients ???? OR asked differently, if you don't have a
> notion of who the recipients are/should be, why would you care if there
> are none??????
> There are practically no real applications for this feature. 

It's true that if I have a nonzero number of listeners it doesn't tell 
me anything since I don't know if the right one is included.  However, 
if I send a message and there were *no* listeners but I know that there 
should be at least one, then I can log the anomaly, raise an alarm, or 
take whatever action is appropriate.

> Also: Keep in mind that either you do multicast, or explisit send to
> all, the data you're sending are copied from you buffer to the dest
> sockets recv buffers anyway. If you're sending 1k you need somewhere
> between 250 to 1000 cycles to do the copy, depending on alignment. I've
> measured the syscall overhead for a write(len=0) to be about 800 cycles
> on a P3 or athlon, and about 2000 on P4. If you really have enough
> possible recipients, you should use a shm segment instead. If you have
> only a few (~10) the overhead is worst case 20000 cycles, or on a 2G P4,
> 10 microsecs to do a syscall for each. Who cares... 

Granted, shared memory (or sysV message queues) are the fastest way to 
transfer data between processes.  However, you still have to implement 
some way to alert the receiver that there is a message waiting for it.

For large packet sizes it may be sufficient to send a small unix socket 
message to alert it that there is a message waiting, but for small 
messages the cost of the copying is small compared to the cost of the 
context switch, and the unix multicast cuts the number of context 
switches in half.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 17:09   ` Chris Friesen
@ 2003-03-03 16:55     ` David S. Miller
  2003-03-03 18:07       ` Chris Friesen
  2003-03-03 19:39     ` Terje Eggestad
  1 sibling, 1 reply; 24+ messages in thread
From: David S. Miller @ 2003-03-03 16:55 UTC (permalink / raw)
  To: cfriesen; +Cc: terje.eggestad, linux-kernel, netdev, linux-net

   From: Chris Friesen <cfriesen@nortelnetworks.com>
   Date: Mon, 03 Mar 2003 12:09:37 -0500

   Unless you poll for messages on the receiving side, how do you trigger 
   the receiver to look for a message?

Send signals.  Use a FUTEX, be creative...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 16:55     ` David S. Miller
@ 2003-03-03 18:07       ` Chris Friesen
  2003-03-03 17:56         ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Friesen @ 2003-03-03 18:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: terje.eggestad, linux-kernel, netdev, linux-net

David S. Miller wrote:
>    From: Chris Friesen <cfriesen@nortelnetworks.com>
>    Date: Mon, 03 Mar 2003 12:09:37 -0500
> 
>    Unless you poll for messages on the receiving side, how do you trigger 
>    the receiver to look for a message?
> 
> Send signals.  Use a FUTEX, be creative...

Suppose I have a process that waits on UDP packets, the unified local 
IPC that we're discussing, other unix sockets, and stdin.  It's awfully 
nice if the local IPC can be handled using the same select/poll 
mechanism as all the other messaging.


Chris




-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 18:07       ` Chris Friesen
@ 2003-03-03 17:56         ` David S. Miller
  2003-03-03 19:11           ` Chris Friesen
  0 siblings, 1 reply; 24+ messages in thread
From: David S. Miller @ 2003-03-03 17:56 UTC (permalink / raw)
  To: cfriesen; +Cc: terje.eggestad, linux-kernel, netdev, linux-net

   From: Chris Friesen <cfriesen@nortelnetworks.com>
   Date: Mon, 03 Mar 2003 13:07:45 -0500

   Suppose I have a process that waits on UDP packets, the unified local 
   IPC that we're discussing, other unix sockets, and stdin.  It's awfully 
   nice if the local IPC can be handled using the same select/poll 
   mechanism as all the other messaging.

So use UDP, you still haven't backed up your performance
claims.  Experiment, set the SO_NO_CHECK socket option to
"1" and see if that makes a difference performance wise
for local clients.

But if performance is "so important", then you shouldn't really be
shying away from the shared memory suggestion and nothing is going to
top that (it eliminates all the copies, using flat out AF_UNIX over
UDP only truly eliminates some header processing, nothing more, the
copies are still there with AF_UNIX).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 17:56         ` David S. Miller
@ 2003-03-03 19:11           ` Chris Friesen
  2003-03-03 18:56             ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Friesen @ 2003-03-03 19:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: terje.eggestad, linux-kernel, netdev, linux-net

David S. Miller wrote:
>    From: Chris Friesen <cfriesen@nortelnetworks.com>
>    Date: Mon, 03 Mar 2003 13:07:45 -0500
>    
>    Suppose I have a process that waits on UDP packets, the unified local 
>    IPC that we're discussing, other unix sockets, and stdin.  It's awfully 
>    nice if the local IPC can be handled using the same select/poll 
>    mechanism as all the other messaging.
> 
> So use UDP, you still haven't backed up your performance
> claims.  Experiment, set the SO_NO_CHECK socket option to
> "1" and see if that makes a difference performance wise
> for local clients.

I did provide numbers for UDP latency, which is more critical for my own 
application since most messages fit within a single packet.  I haven't 
done UDP bandwidth testing--I need to check how lmbench did it for the 
unix socket and do the same for UDP.  Local TCP was far slower than unix 
sockets though.

> But if performance is "so important", then you shouldn't really be
> shying away from the shared memory suggestion and nothing is going to
> top that (it eliminates all the copies, using flat out AF_UNIX over
> UDP only truly eliminates some header processing, nothing more, the
> copies are still there with AF_UNIX).

Yes, I realize that the receiver still has to do a copy.  With large 
messages this could be an issue.  With small messages, I had assumed 
that the cost of a recv() wouldn't be that much worse than the cost of 
the sender doing a kill() to alert the receiver that a message is 
waiting.  Maybe I was wrong.

It might be interesting to try a combination of sysV msg queue and 
signals to see how it stacks up.  Project for tonight.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 19:11           ` Chris Friesen
@ 2003-03-03 18:56             ` David S. Miller
  2003-03-03 19:42               ` Terje Eggestad
  0 siblings, 1 reply; 24+ messages in thread
From: David S. Miller @ 2003-03-03 18:56 UTC (permalink / raw)
  To: cfriesen; +Cc: terje.eggestad, linux-kernel, netdev, linux-net

   From: Chris Friesen <cfriesen@nortelnetworks.com>
   Date: Mon, 03 Mar 2003 14:11:07 -0500

   I haven't done UDP bandwidth testing--I need to check how lmbench
   did it for the unix socket and do the same for UDP.  Local TCP was
   far slower than unix sockets though.

That result is system specific and depends upon how the data and
datastructures hit the cpu cachelines in the kernel.

TCP bandwidth is slightly faster than AF_UNIX bandwidth on my
sparc64 boxes for example.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 18:56             ` David S. Miller
@ 2003-03-03 19:42               ` Terje Eggestad
  2003-03-03 21:32                 ` Chris Friesen
  0 siblings, 1 reply; 24+ messages in thread
From: Terje Eggestad @ 2003-03-03 19:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: cfriesen, linux-kernel, netdev, linux-net

On Mon, 2003-03-03 at 19:56, David S. Miller wrote:
       From: Chris Friesen <cfriesen@nortelnetworks.com>
       Date: Mon, 03 Mar 2003 14:11:07 -0500
       
       I haven't done UDP bandwidth testing--I need to check how lmbench
       did it for the unix socket and do the same for UDP.  Local TCP was
       far slower than unix sockets though.
    
    That result is system specific and depends upon how the data and
    datastructures hit the cpu cachelines in the kernel.
    
    TCP bandwidth is slightly faster than AF_UNIX bandwidth on my
    sparc64 boxes for example.

I've seen that their are the same on linux.I tried to to do AF_UNIX
instead of AF_INET internally to boost perf, but to no avail. Makes you
suspect that the loopback device actually create an AF_UNIX connection
under the hood ;-)


-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 19:42               ` Terje Eggestad
@ 2003-03-03 21:32                 ` Chris Friesen
  2003-03-03 23:38                   ` Terje Eggestad
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Friesen @ 2003-03-03 21:32 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: David S. Miller, linux-kernel, netdev, linux-net

Terje Eggestad wrote:
> On Mon, 2003-03-03 at 19:56, David S. Miller wrote:

>     TCP bandwidth is slightly faster than AF_UNIX bandwidth on my
>     sparc64 boxes for example.
> 
> I've seen that their are the same on linux.I tried to to do AF_UNIX
> instead of AF_INET internally to boost perf, but to no avail. Makes you
> suspect that the loopback device actually create an AF_UNIX connection
> under the hood ;-)

On my P4 1.8GHz, AF_INET vs AF_UNIX looks like this:


*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------
Host           OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                   ctxsw       UNIX         UDP         TCP conn
--------- ------- ----- ----- ---- ----- ----- ----- ----- ----
pcard0ks. 2.4.18- 1.740  10.4 15.9  20.1  33.1  23.5  44.3 72.7
pcard0ks. 2.4.18- 1.560  10.6 16.0  23.4  38.1  36.1  44.6 77.4


*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host          OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                        UNIX      reread reread (libc) (hand) read write
--------- ------- ---- ---- ---- ------ ------ ------ ------ ---- -----
pcard0ks. 2.4.18- 650. 677. 151.  721.9  958.0  290.8  288.8 955. 418.4
pcard0ks. 2.4.18- 379. 701. 163.  714.8  949.5  289.5  288.5 956. 420.5


On this machine at least, UDP latency is 25% worse than AF_UNIX, and TCP 
bandwidth is about 22% that of AF_UNIX.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 21:32                 ` Chris Friesen
@ 2003-03-03 23:38                   ` Terje Eggestad
  0 siblings, 0 replies; 24+ messages in thread
From: Terje Eggestad @ 2003-03-03 23:38 UTC (permalink / raw)
  To: Chris Friesen; +Cc: David S. Miller, linux-kernel, netdev, linux-net

The latency I belive, a 25% increase don't matter all that much. (
routinely send meesages sub micro second.  

that tcp BW is ridiculus low, make sure that you run with with good
sized socket buffers, and that tcp windowing is enabled.

But then again, if you want to send much data fast between processes, a
stream socket is a pretty bad idea anyway. 
A) shm
b) mmap a file, write into it, and send the filenake to the other side,
then mmap it there. 

Don't underestemate the BW of a fedex'ed  tape.

TJ

On Mon, 2003-03-03 at 22:32, Chris Friesen wrote:
    Terje Eggestad wrote:
    > On Mon, 2003-03-03 at 19:56, David S. Miller wrote:
    
    >     TCP bandwidth is slightly faster than AF_UNIX bandwidth on my
    >     sparc64 boxes for example.
    > 
    > I've seen that their are the same on linux.I tried to to do AF_UNIX
    > instead of AF_INET internally to boost perf, but to no avail. Makes you
    > suspect that the loopback device actually create an AF_UNIX connection
    > under the hood ;-)
    
    On my P4 1.8GHz, AF_INET vs AF_UNIX looks like this:
    
    
    *Local* Communication latencies in microseconds - smaller is better
    -------------------------------------------------------------
    Host           OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                       ctxsw       UNIX         UDP         TCP conn
    --------- ------- ----- ----- ---- ----- ----- ----- ----- ----
    pcard0ks. 2.4.18- 1.740  10.4 15.9  20.1  33.1  23.5  44.3 72.7
    pcard0ks. 2.4.18- 1.560  10.6 16.0  23.4  38.1  36.1  44.6 77.4
    
    
    *Local* Communication bandwidths in MB/s - bigger is better
    -----------------------------------------------------------
    Host          OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                            UNIX      reread reread (libc) (hand) read write
    --------- ------- ---- ---- ---- ------ ------ ------ ------ ---- -----
    pcard0ks. 2.4.18- 650. 677. 151.  721.9  958.0  290.8  288.8 955. 418.4
    pcard0ks. 2.4.18- 379. 701. 163.  714.8  949.5  289.5  288.5 956. 420.5
    
    
    On this machine at least, UDP latency is 25% worse than AF_UNIX, and TCP 
    bandwidth is about 22% that of AF_UNIX.
    
    Chris
    
    -- 
    Chris Friesen                    | MailStop: 043/33/F10
    Nortel Networks                  | work: (613) 765-0557
    3500 Carling Avenue              | fax:  (613) 765-2986
    Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com
-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 17:09   ` Chris Friesen
  2003-03-03 16:55     ` David S. Miller
@ 2003-03-03 19:39     ` Terje Eggestad
  2003-03-03 22:29       ` Chris Friesen
  1 sibling, 1 reply; 24+ messages in thread
From: Terje Eggestad @ 2003-03-03 19:39 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net, davem

On Mon, 2003-03-03 at 18:09, Chris Friesen wrote:
    Terje Eggestad wrote:
    > On a single box you would use a shared memory segment to do this. It has
    > the following advantages:
    > - no syscalls at all
    
    Unless you poll for messages on the receiving side, how do you trigger 
    the receiver to look for a message?  Shared memory doesn't have file 
    descriptors.
    
OK, you want multicast to send the *same* info to all peers. The only of
two sane reason to do that is to update the peers with some info they
need to do real work. So when there is reel work to be done, the info is
available in the shm. 

The other reason is to tell the others to die. Then you a) have a
socket/pipe connected that you get a end of file event on, or, you have
a timeout on the select() (in any real life app you should anyway) so
that when select/poll return -1 with errno=EINTR, you check some flags
in shm. 

If you *had* multicast, you don't know *when* a peer proccessed it. 
What if the peer is suspended ??? you don't get an error on the send,
and you apparently never get an answer, then what? The peer may also
gone haywire on a while(1);
   
I have an OSS project project (http://midway.sourceforge.net/) where I
have a gateway daemon that poll on a large set of sockets (TCP/IP
clients) and passes the request to IPC servers, and back. The way I'm
doing that is to have two threads, on on blocking wait on the
select/poll, the other on msgrcv. Works quite well. 

  
    > - whenever the recipients need to use the info, they access the shm
    > directly (you may need to use a semaphore to enforce consistency, or if
    > you're really pressed on time, spin lock a shm location) There is no
    > need for the recipients to copy the info to private data structs.
    
    How do they know the information has changed?  Suppose one process 
    detects that the ethernet link has dropped.  How does it alert other 
    processes which need to do something?
    
Again, if you want someone to do something, they must ack the request
before you can safely assume that they are going to do something.
 
    > Why does it help you to know that there are no recipients contra the
    > wrong number recipients ???? OR asked differently, if you don't have a
    > notion of who the recipients are/should be, why would you care if there
    > are none??????
    > There are practically no real applications for this feature. 
    
    It's true that if I have a nonzero number of listeners it doesn't tell 
    me anything since I don't know if the right one is included.  However, 
    if I send a message and there were *no* listeners but I know that there 
    should be at least one, then I can log the anomaly, raise an alarm, or 
    take whatever action is appropriate.
    
    > Also: Keep in mind that either you do multicast, or explisit send to
    > all, the data you're sending are copied from you buffer to the dest
    > sockets recv buffers anyway. If you're sending 1k you need somewhere
    > between 250 to 1000 cycles to do the copy, depending on alignment. I've
    > measured the syscall overhead for a write(len=0) to be about 800 cycles
    > on a P3 or athlon, and about 2000 on P4. If you really have enough
    > possible recipients, you should use a shm segment instead. If you have
    > only a few (~10) the overhead is worst case 20000 cycles, or on a 2G P4,
    > 10 microsecs to do a syscall for each. Who cares... 
    
    Granted, shared memory (or sysV message queues) are the fastest way to 
    transfer data between processes.  However, you still have to implement 
    some way to alert the receiver that there is a message waiting for it.
    
    For large packet sizes it may be sufficient to send a small unix socket 
    message to alert it that there is a message waiting, but for small 
    messages the cost of the copying is small compared to the cost of the 
    context switch, and the unix multicast cuts the number of context 
    switches in half.
    
    Chris
    
    -- 
    Chris Friesen                    | MailStop: 043/33/F10
    Nortel Networks                  | work: (613) 765-0557
    3500 Carling Avenue              | fax:  (613) 765-2986
    Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com



-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 19:39     ` Terje Eggestad
@ 2003-03-03 22:29       ` Chris Friesen
  2003-03-03 23:29         ` Terje Eggestad
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Friesen @ 2003-03-03 22:29 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: linux-kernel, netdev, linux-net, davem

Terje Eggestad wrote:
> On Mon, 2003-03-03 at 18:09, Chris Friesen wrote:
>     Terje Eggestad wrote:
>     > On a single box you would use a shared memory segment to do this. It has
>     > the following advantages:
>     > - no syscalls at all
>     
>     Unless you poll for messages on the receiving side, how do you trigger 
>     the receiver to look for a message?  Shared memory doesn't have file 
>     descriptors.
>     
> OK, you want multicast to send the *same* info to all peers. The only of
> two sane reason to do that is to update the peers with some info they
> need to do real work. So when there is reel work to be done, the info is
> available in the shm. 

Okay, but how do they know there is work to be done?  They're waiting in 
select() monitoring sockets, fds, being hit with signals, etc.  How do 
you tell them to check their messages?  You have to hit them over the 
head with a signal or something and tell them to check the shared memory 
messages.
> If you *had* multicast, you don't know *when* a peer proccessed it. 
> What if the peer is suspended ??? you don't get an error on the send,
> and you apparently never get an answer, then what? The peer may also
> gone haywire on a while(1);

Exactly.  So if the message got delivered you have no way of knowing for 
sure that it was processed and you have application-level timers and 
stuff. But if the message wasn't delivered to anyone and you know it 
should have been, then you don't have to wait for the timer to expire to 
know that they didn't get it.

>     How do they know the information has changed?  Suppose one process 
>     detects that the ethernet link has dropped.  How does it alert other 
>     processes which need to do something?
>     
> Again, if you want someone to do something, they must ack the request
> before you can safely assume that they are going to do something.

Certainly.  My point was that if you're trying to handle all events in a 
single thread, you need some way to tell the message recipient that it 
needs to check the shared memory buffer.  Otherwise you need multiple 
threads like you mentioned in your project description.


Chris


-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 22:29       ` Chris Friesen
@ 2003-03-03 23:29         ` Terje Eggestad
  2003-03-04  2:38           ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Terje Eggestad @ 2003-03-03 23:29 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net, davem

My point is that you can't send a request real work with either shm nor
multicast. You don't know who or howmany recipients there are. You just
use it to update someone that do real work. Then they tend not to need
it until they get a request for real work, then alost always on a tcp
connection or as audp (unicast) message.

How do you design a protocol that uses multicast to send a request to do
work?

All uses I can think of right now of multicast/broadcast is:
* Discovery, like in NIS.
* Announcements like in OSPF. 
* update like in NTP broadcast

DHCP is actually a nice example of very very bad things that happen if
you loose control of how many servers that are running.

On Mon, 2003-03-03 at 23:29, Chris Friesen wrote:
    Terje Eggestad wrote:
    > On Mon, 2003-03-03 at 18:09, Chris Friesen wrote:

    > If you *had* multicast, you don't know *when* a peer proccessed it. 
    > What if the peer is suspended ??? you don't get an error on the send,
    > and you apparently never get an answer, then what? The peer may also
    > gone haywire on a while(1);
    
    Exactly.  So if the message got delivered you have no way of knowing for 
    sure that it was processed and you have application-level timers and 
    stuff. But if the message wasn't delivered to anyone and you know it 
    should have been, then you don't have to wait for the timer to expire to 
    know that they didn't get it.
    

Nice to know, but it help you, how?

If there is a subscriber out there that is hung? You need that timer
*anyway*. Why the special case? 

All I see you're trying to do is something like this (just the
nonblocking version):


do_unix_mcast(message)
{
alarm(timeout);

rc = write(fd_unixmultocast, message, mlen);

if (rc == -1   && errno == nosubscribers) goto they_are_all_dead;

rc = select( fd_unixmultocast ++);
if (rc == -1  && errno = EINTR) goto they_are_all_dead;
alarm(0);
process_reply();
return;

they_all_dead:

handle_all_dead_peers();
return;
};
    
    Chris
    
    
    -- 
    Chris Friesen                    | MailStop: 043/33/F10
    Nortel Networks                  | work: (613) 765-0557
    3500 Carling Avenue              | fax:  (613) 765-2986
    Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com
-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: anyone ever done multicast AF_UNIX sockets?
  2003-03-03 23:29         ` Terje Eggestad
@ 2003-03-04  2:38           ` jamal
  0 siblings, 0 replies; 24+ messages in thread
From: jamal @ 2003-03-04  2:38 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: Chris Friesen, linux-kernel, netdev, linux-net, davem

Hi Terje,

On Mon, 4 Mar 2003, Terje Eggestad wrote:

> How do you design a protocol that uses multicast to send a request to do
> work?
>
> All uses I can think of right now of multicast/broadcast is:
> * Discovery, like in NIS.
> * Announcements like in OSPF.
> * update like in NTP broadcast
>

I know we are digressing away from main discussion ...

The concept of reliable multicast is known to be useful.
Look at(for some sample apps):
http://www.ietf.org/html.charters/rmt-charter.html

But we are talking about a distributed system in that context.

Agreed, reliability and multicast do not always make sense.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <3E5E7081.6020704@nortelnetworks.com.suse.lists.linux.kernel>]

[parent not found: <20030228083009.Y53276@shell.cyberus.ca.suse.lists.linux.kernel>]

[parent not found: <3E5F748E.2080605@nortelnetworks.com.suse.lists.linux.kernel>]

[parent not found: <20030228212309.C57212@shell.cyberus.ca.suse.lists.linux.kernel>]

[parent not found: <3E619E97.8010508@nortelnetworks.com.suse.lists.linux.kernel>]

[parent not found: <20030302081916.S61365@shell.cyberus.ca.suse.lists.linux.kernel>]

[parent not found: <3E6398C4.2020605@nortelnetworks.com.suse.lists.linux.kernel>]

* Re: anyone ever done multicast AF_UNIX sockets?
       [not found]           ` <3E6398C4.2020605@nortelnetworks.com.suse.lists.linux.kernel>
@ 2003-03-03 18:18             ` Andi Kleen
  0 siblings, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2003-03-03 18:18 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, netdev, linux-net, hadi

Chris Friesen <cfriesen@nortelnetworks.com> writes:

> I'll look at how they were measuring unix socket throughput and try 
> implementing something similar for UDP.  It's not clear to me how to 
> really measure throughput in a multicast environment though since it 
> depends very much on your application messaging patterns.

Unix sockets are often slower than TCP over loopback because they use
much smaller socket sizes by default. This causes much more context 
switches.

Just run a vmstat 1 in parallel and watch the context switch rates.

You can fix it by increasing the send and receive buffers of the unix
socket.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2003-03-04  2:38 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-27 20:09 anyone ever done multicast AF_UNIX sockets? Chris Friesen
2003-02-27 22:21 ` Greg Daley
2003-02-28 13:33 ` jamal
2003-02-28 14:39   ` Chris Friesen
2003-03-01  3:18     ` jamal
2003-03-02  6:03       ` Chris Friesen
2003-03-02 14:11         ` jamal
2003-03-03 18:02           ` Chris Friesen
2003-03-03 12:51 ` Terje Eggestad
2003-03-03 12:35   ` David S. Miller
2003-03-03 17:09   ` Chris Friesen
2003-03-03 16:55     ` David S. Miller
2003-03-03 18:07       ` Chris Friesen
2003-03-03 17:56         ` David S. Miller
2003-03-03 19:11           ` Chris Friesen
2003-03-03 18:56             ` David S. Miller
2003-03-03 19:42               ` Terje Eggestad
2003-03-03 21:32                 ` Chris Friesen
2003-03-03 23:38                   ` Terje Eggestad
2003-03-03 19:39     ` Terje Eggestad
2003-03-03 22:29       ` Chris Friesen
2003-03-03 23:29         ` Terje Eggestad
2003-03-04  2:38           ` jamal
     [not found] <3E5E7081.6020704@nortelnetworks.com.suse.lists.linux.kernel>
     [not found] ` <20030228083009.Y53276@shell.cyberus.ca.suse.lists.linux.kernel>
     [not found]   ` <3E5F748E.2080605@nortelnetworks.com.suse.lists.linux.kernel>
     [not found]     ` <20030228212309.C57212@shell.cyberus.ca.suse.lists.linux.kernel>
     [not found]       ` <3E619E97.8010508@nortelnetworks.com.suse.lists.linux.kernel>
     [not found]         ` <20030302081916.S61365@shell.cyberus.ca.suse.lists.linux.kernel>
     [not found]           ` <3E6398C4.2020605@nortelnetworks.com.suse.lists.linux.kernel>
2003-03-03 18:18             ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).