Sockets inside the kernel or userspace ?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Sockets inside the kernel or userspace ?
@ 2006-06-30  7:32 Daniel Bonekeeper
  2006-06-30  7:57 ` Evgeniy Polyakov
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Bonekeeper @ 2006-06-30  7:32 UTC (permalink / raw)
  To: netdev

Let's suppose that I'm writing a experimental distributed filesystem
that needs to open a TCP socket to another machines on the LAN, keep a
pool of connections and be always aware of new data arriving (like a
userspace select()). What's the best approach to implement this ? Is
it better to keep all the TCP socket stuff in userspace and use an
interface like netlink to talk with it ? Or, since we're talking about
a filesystem (where performance is a must), is it better to keep it in
kernel mode ?

Thanks!

-- 
"Quanto mais conheço os homens, mais gosto dos meus cavalos"
- João Figueiredo, former Brazilian president.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Sockets inside the kernel or userspace ?
  2006-06-30  7:32 Sockets inside the kernel or userspace ? Daniel Bonekeeper
@ 2006-06-30  7:57 ` Evgeniy Polyakov
  2006-06-30  8:45   ` Daniel Bonekeeper
  0 siblings, 1 reply; 4+ messages in thread
From: Evgeniy Polyakov @ 2006-06-30  7:57 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: netdev

On Fri, Jun 30, 2006 at 03:32:28AM -0400, Daniel Bonekeeper (thehazard@gmail.com) wrote:
> Let's suppose that I'm writing a experimental distributed filesystem
> that needs to open a TCP socket to another machines on the LAN, keep a
> pool of connections and be always aware of new data arriving (like a
> userspace select()). What's the best approach to implement this ? Is
> it better to keep all the TCP socket stuff in userspace and use an
> interface like netlink to talk with it ? Or, since we're talking about
> a filesystem (where performance is a must), is it better to keep it in
> kernel mode ?

It depends on your design.
NFS uses in-kernel sockets, but userspace can easily fill 1Gbit link too.
FS must eliminate as much coping as possible, but without deep digging 
into the socket code you will get copy both in kernelspace (one copy 
from socket queue into you buffer) and userspace (the same copy, but 
using slower copy_to_user(), depending on the size of each copy it can 
make noticeble difference), but with kernel socket you get your data 
in the fs/vfs cache already, but with userspace you must copy it back 
into the kernel using slow copy_from_user(), but if data is supposed to
be somehow (heavily) processed before reaching the harddrive (for
example compressed or encrypted), cost of processing can fully hide cost
of the copy itself, so userspace is much more preferable in that
situation due to it's much more convenient development process.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Sockets inside the kernel or userspace ?
  2006-06-30  7:57 ` Evgeniy Polyakov
@ 2006-06-30  8:45   ` Daniel Bonekeeper
  2006-06-30  9:12     ` Evgeniy Polyakov
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Bonekeeper @ 2006-06-30  8:45 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Thanks for the thoughts, Evgeniy !

Well... I was thinking in developing something like that (nothing
actually very usefull, just a little "something" to get me more
comfortable with fs and net development):

1) Inside a gigabit LAN there will be, let's say, 10 machines, that
are meant to be used as filesystem nodes. Those machines have a daemon
running in userspace ( "dfsd" ) and have one or more partitions of
physical(s) HD(s) dedicated to the "filesystem cluster". So, let's
suppose that on every node we have a /dev/hdb5 with 20GB unused,
dedicated to the cluster ( "/usr/bin/dfsd -p /dev/hda5" ). This is to
keep things simple (since we can have raw access to the partition),
but we could use files on the local filesystem too.

2) On the master machine, the DFS kernel module (which declares a
block device like /dev/dfs1) uses broadcast packages (something like
DHCP) to retrieve the list of active nodes on the LAN. So, with 10
machines with 20GB each, we have 200GB of distributed storage over the
network. To keep things simple, let's say that they are addressed in a
serial fashion (requests from 0-20GB goes to the node1, 20-40GB to
node2, etc). The module is responsible for keeping a pool of TCP
connections with the nodes' daemons, for sending, receiving and
parsing the data, etc. At this point, no security measures are taken
(encryption, etc).

At this point, I think that we should be able to create a reiserfs fs
on the device and get it running (even if far slower than a local
disk). The second part of the project, which would involve more
serious stuff, could be:

3) Redundancy - minimizing the storage capacity, but being able to
safely continue to work if one of the nodes are down. Actually I don't
have any clue on how to achieve this without drastically diminish the
storage capacity, but probably there is some clever way out there =]

4) No masters - each node can have access to the filesystem (the block
device) as if it was a NFS mountpoint (this could be useful somehow to
actual clusters, where you could not only share the processor, but
also the HD of the nodes as a single huge / mountpoint). In this
model, there would be no userspace stuff at all.

What you think ?

Daniel

On 6/30/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Fri, Jun 30, 2006 at 03:32:28AM -0400, Daniel Bonekeeper (thehazard@gmail.com) wrote:
> > Let's suppose that I'm writing a experimental distributed filesystem
> > that needs to open a TCP socket to another machines on the LAN, keep a
> > pool of connections and be always aware of new data arriving (like a
> > userspace select()). What's the best approach to implement this ? Is
> > it better to keep all the TCP socket stuff in userspace and use an
> > interface like netlink to talk with it ? Or, since we're talking about
> > a filesystem (where performance is a must), is it better to keep it in
> > kernel mode ?
>
> It depends on your design.
> NFS uses in-kernel sockets, but userspace can easily fill 1Gbit link too.
> FS must eliminate as much coping as possible, but without deep digging
> into the socket code you will get copy both in kernelspace (one copy
> from socket queue into you buffer) and userspace (the same copy, but
> using slower copy_to_user(), depending on the size of each copy it can
> make noticeble difference), but with kernel socket you get your data
> in the fs/vfs cache already, but with userspace you must copy it back
> into the kernel using slow copy_from_user(), but if data is supposed to
> be somehow (heavily) processed before reaching the harddrive (for
> example compressed or encrypted), cost of processing can fully hide cost
> of the copy itself, so userspace is much more preferable in that
> situation due to it's much more convenient development process.
>
> --
>         Evgeniy Polyakov
>

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Sockets inside the kernel or userspace ?
  2006-06-30  8:45   ` Daniel Bonekeeper
@ 2006-06-30  9:12     ` Evgeniy Polyakov
  0 siblings, 0 replies; 4+ messages in thread
From: Evgeniy Polyakov @ 2006-06-30  9:12 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: netdev

On Fri, Jun 30, 2006 at 04:45:54AM -0400, Daniel Bonekeeper (thehazard@gmail.com) wrote:
> 1) Inside a gigabit LAN there will be, let's say, 10 machines, that
> are meant to be used as filesystem nodes. Those machines have a daemon
> running in userspace ( "dfsd" ) and have one or more partitions of
> physical(s) HD(s) dedicated to the "filesystem cluster". So, let's
> suppose that on every node we have a /dev/hdb5 with 20GB unused,
> dedicated to the cluster ( "/usr/bin/dfsd -p /dev/hda5" ). This is to
> keep things simple (since we can have raw access to the partition),
> but we could use files on the local filesystem too.
>
> 2) On the master machine, the DFS kernel module (which declares a
> block device like /dev/dfs1) uses broadcast packages (something like
> DHCP) to retrieve the list of active nodes on the LAN. So, with 10
> machines with 20GB each, we have 200GB of distributed storage over the
> network. To keep things simple, let's say that they are addressed in a
> serial fashion (requests from 0-20GB goes to the node1, 20-40GB to
> node2, etc). The module is responsible for keeping a pool of TCP
> connections with the nodes' daemons, for sending, receiving and
> parsing the data, etc. At this point, no security measures are taken
> (encryption, etc).

At this point you can mount all remote nodes on one master and export it
over NFS. It is not distributed FS.

> At this point, I think that we should be able to create a reiserfs fs
> on the device and get it running (even if far slower than a local
> disk). The second part of the project, which would involve more
> serious stuff, could be:
> 
> 3) Redundancy - minimizing the storage capacity, but being able to
> safely continue to work if one of the nodes are down. Actually I don't
> have any clue on how to achieve this without drastically diminish the
> storage capacity, but probably there is some clever way out there =]

Several nodes have the same data, so if one of them has failed, one can
continue data processing. That means either tree-like strucrure where
local master replicate data between the nodes, or fully distributed fs
(below).

> 4) No masters - each node can have access to the filesystem (the block
> device) as if it was a NFS mountpoint (this could be useful somehow
> tlly o
> actual clusters, where you could not only share the processor, but
> also the HD of the nodes as a single huge / mountpoint). In this
> model, there would be no userspace stuff at all.

Fully distributed mode does not even suppose some "master node"
existense, since it will quickly became a bottleneck.
Each node might have some list of nodes it synchronizes with, so if one
of the node is turned off, others still have valid data and machine
which requested the data can "reconnect" to another node and get it's
data. This involves interesting CS thoughts about interconnects (trees,
rings, multidimentional torus and so on) and other components of the
system.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-06-30  9:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-30  7:32 Sockets inside the kernel or userspace ? Daniel Bonekeeper
2006-06-30  7:57 ` Evgeniy Polyakov
2006-06-30  8:45   ` Daniel Bonekeeper
2006-06-30  9:12     ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).