Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
       [not found]                 ` <20050327035149.GD4053@g5.random>
@ 2005-03-27  5:48                   ` Matt Mackall
  2005-03-27  6:04                     ` Andrea Arcangeli
  2005-03-27  6:33                     ` Dmitry Yusupov
  0 siblings, 2 replies; 91+ messages in thread
From: Matt Mackall @ 2005-03-27  5:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Christie, Dmitry Yusupov, open-iscsi, James.Bottomley,
	ksummit-2005-discuss, netdev

I'm cc:ing this to netdev, where this discussion really ought to be.
There's a separate networking summit and I suspect most of the
networking heavies aren't reading ksummit-discuss or open-iscsi.
It's getting rather far afield for ksummit-discuss so people should
trim that from follow-ups.

On Sun, Mar 27, 2005 at 05:51:49AM +0200, Andrea Arcangeli wrote:
> On Thu, Mar 24, 2005 at 07:43:41PM -0800, Matt Mackall wrote:
> > There may be network multipath. But I think we can have a single
> > socket mempool per logical device and a single skbuff mempool shared
> > among those sockets.
> 
> If we'll have to reserve more than 1 packet per each socket context,
> then the mempool probably can't be shared.

I believe the mempool can be shared among all sockets that represent
the same storage device. Packets out any socket represent progress.

> I wonder if somebody has ever reproduced deadlocks
> by swapping on software-tcp-iscsi.

Yes, done before it was even called iSCSI.

> > And that still leaves us with the lack of buffers to receive ACKs
> > problem, which is perhaps worse.
> 
> The mempooling should take care of the acks too.

The receive buffer is allocated at the time we DMA it from the card.
We have no idea of its contents and we won't know what socket mempool
to pull the receive skbuff from until much higher in the network
stack, which could be quite a while later if we're under OOM load. And
we can't have a mempool big enough to handle all the traffic that
might potentially be deferred for softirq processing when we're OOM,
especially at gigabit rates.

I think this is actually the tricky piece of the problem and solving
the socket and send buffer allocation doesn't help until this gets
figured out.

We could perhaps try to address this with another special receive-side
alloc_skb that fails most of the time on OOM but sometimes pulls from
a special reserve.

> Perhaps the mempooling overhead will be too huge to pay for it even when
> it's not necessary, in such case the iscsid will have to pass a new
> bitflag to the socket syscall, when it creates the socket meant to talk
> with the remote disk.

I think we probably attach a mempool to a socket after the fact. And
no, we can't have a mempool attached to every socket.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  5:48                   ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Matt Mackall
@ 2005-03-27  6:04                     ` Andrea Arcangeli
  2005-03-27  6:38                       ` Matt Mackall
  2005-03-27  6:33                     ` Dmitry Yusupov
  1 sibling, 1 reply; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-27  6:04 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Mike Christie, Dmitry Yusupov, open-iscsi, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sat, Mar 26, 2005 at 09:48:31PM -0800, Matt Mackall wrote:
> I believe the mempool can be shared among all sockets that represent
> the same storage device. Packets out any socket represent progress.

What's the point to have more than one socket connected to each storage
device anyway?

> Yes, done before it was even called iSCSI.

Ok, theoretical deadlock conditions aren't nice anyway, but knowing this
is a real life problem too makes it more interesting ;).

> The receive buffer is allocated at the time we DMA it from the card.
> We have no idea of its contents and we won't know what socket mempool
> to pull the receive skbuff from until much higher in the network
> stack, which could be quite a while later if we're under OOM load. And
> we can't have a mempool big enough to handle all the traffic that
> might potentially be deferred for softirq processing when we're OOM,
> especially at gigabit rates.
> 
> I think this is actually the tricky piece of the problem and solving
> the socket and send buffer allocation doesn't help until this gets
> figured out.
> 
> We could perhaps try to address this with another special receive-side
> alloc_skb that fails most of the time on OOM but sometimes pulls from
> a special reserve.

One algo to handle this is: after we get the gfp_atomic failure, we
look at all the mempools are registered for a certain NIC, and we pick
a random mempools that isn't empty. We use the non-empty mempool to
receive the packet, and we let the netif_rx process the packet. Then if
going up the stack we find that the packet doesn't belong to the
socket-mempool, we discard the packet and we release the ram back into
the mempool. This should make progress since eventually the right packet
will go in the right mempool.

> > Perhaps the mempooling overhead will be too huge to pay for it even when
> > it's not necessary, in such case the iscsid will have to pass a new
> > bitflag to the socket syscall, when it creates the socket meant to talk
> > with the remote disk.
> 
> I think we probably attach a mempool to a socket after the fact. And

I guess you meant before the fact (i.e. before the connection to the
server), anything attached after the fact (whatever the fact is ;) isn't
going to help.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  5:48                   ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Matt Mackall
  2005-03-27  6:04                     ` Andrea Arcangeli
@ 2005-03-27  6:33                     ` Dmitry Yusupov
  2005-03-27  6:46                       ` David S. Miller
  1 sibling, 1 reply; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-27  6:33 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrea Arcangeli, Mike Christie, open-iscsi@googlegroups.com,
	James.Bottomley, ksummit-2005-discuss, netdev

On Sat, 2005-03-26 at 21:48 -0800, Matt Mackall wrote:
> I'm cc:ing this to netdev, where this discussion really ought to be.
> There's a separate networking summit and I suspect most of the
> networking heavies aren't reading ksummit-discuss or open-iscsi.
> It's getting rather far afield for ksummit-discuss so people should
> trim that from follow-ups.
> 
> On Sun, Mar 27, 2005 at 05:51:49AM +0200, Andrea Arcangeli wrote:
> > On Thu, Mar 24, 2005 at 07:43:41PM -0800, Matt Mackall wrote:
> > > There may be network multipath. But I think we can have a single
> > > socket mempool per logical device and a single skbuff mempool shared
> > > among those sockets.
> > 
> > If we'll have to reserve more than 1 packet per each socket context,
> > then the mempool probably can't be shared.
> 
> I believe the mempool can be shared among all sockets that represent
> the same storage device. Packets out any socket represent progress.
> 
> > I wonder if somebody has ever reproduced deadlocks
> > by swapping on software-tcp-iscsi.
> 
> Yes, done before it was even called iSCSI.
> 
> > > And that still leaves us with the lack of buffers to receive ACKs
> > > problem, which is perhaps worse.
> > 
> > The mempooling should take care of the acks too.
> 
> The receive buffer is allocated at the time we DMA it from the card.
> We have no idea of its contents and we won't know what socket mempool
> to pull the receive skbuff from until much higher in the network
> stack, which could be quite a while later if we're under OOM load. And
> we can't have a mempool big enough to handle all the traffic that
> might potentially be deferred for softirq processing when we're OOM,
> especially at gigabit rates.
> 
> I think this is actually the tricky piece of the problem and solving
> the socket and send buffer allocation doesn't help until this gets
> figured out.
> 
> We could perhaps try to address this with another special receive-side
> alloc_skb that fails most of the time on OOM but sometimes pulls from
> a special reserve.

nope. this will not solve the problem on receive or will just solve it
partially. The right way to solve it would be to provide special API
which will help to re-use link-layer's ring SKBs. i.e. TCP stack should
call NIC driver's callback after all SKB data been successfully copied
to the user space. At that point NIC driver will safely replenish HW
ring. This way we could avoid most of memory allocations on receive.

> > Perhaps the mempooling overhead will be too huge to pay for it even when
> > it's not necessary, in such case the iscsid will have to pass a new
> > bitflag to the socket syscall, when it creates the socket meant to talk
> > with the remote disk.
> 
> I think we probably attach a mempool to a socket after the fact. And
> no, we can't have a mempool attached to every socket.
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:04                     ` Andrea Arcangeli
@ 2005-03-27  6:38                       ` Matt Mackall
  2005-03-27 14:50                         ` Andrea Arcangeli
  0 siblings, 1 reply; 91+ messages in thread
From: Matt Mackall @ 2005-03-27  6:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Christie, Dmitry Yusupov, open-iscsi, James.Bottomley,
	netdev

On Sun, Mar 27, 2005 at 08:04:03AM +0200, Andrea Arcangeli wrote:
> On Sat, Mar 26, 2005 at 09:48:31PM -0800, Matt Mackall wrote:
> > I believe the mempool can be shared among all sockets that represent
> > the same storage device. Packets out any socket represent progress.
> 
> What's the point to have more than one socket connected to each storage
> device anyway?

There may be multiple network addresses (with different network paths)
associated with the same device for purposes of throughput or reliability.

> One algo to handle this is: after we get the gfp_atomic failure, we
> look at all the mempools are registered for a certain NIC, and we pick
> a random mempools that isn't empty. We use the non-empty mempool to
> receive the packet, and we let the netif_rx process the packet. Then if
> going up the stack we find that the packet doesn't belong to the
> socket-mempool, we discard the packet and we release the ram back into
> the mempool. This should make progress since eventually the right packet
> will go in the right mempool.

What if the number of packets queued by the time we reach the softirq
side of the stack exceeds the available buffers?

Imagine that we've got heavy DNS and iSCSI on the same box and that the box
gets wedged in OOM such that it can't answer DNS queries. But we can't
distinguish at receive time between DNS and iSCSI. As iSCSI is TCP, it
will send repeat ACKs at relatively long intervals but the DNS clients
will potentially continue to hammer the machine, filling the reserve
buffers and starving out the ACKs. We've got to essentially be able to
say "we are OOM, drop all traffic to sockets not flagged for storage"
and do so quickly enough that we can eventually get the ACKs.

> > > Perhaps the mempooling overhead will be too huge to pay for it even when
> > > it's not necessary, in such case the iscsid will have to pass a new
> > > bitflag to the socket syscall, when it creates the socket meant to talk
> > > with the remote disk.
> > 
> > I think we probably attach a mempool to a socket after the fact. And
> 
> I guess you meant before the fact (i.e. before the connection to the
> server), anything attached after the fact (whatever the fact is ;) isn't
> going to help.

After the socket is created, but before we commit to pumping storage
data through it (iSCSI has multiple phases). A privileged
setsockopt-like interface ought to suffice. Or something completely
kernel internal.

Which reminds me: FUSE and friends presumably have a very similar set
of problems.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:33                     ` Dmitry Yusupov
@ 2005-03-27  6:46                       ` David S. Miller
  2005-03-27  7:05                         ` Dmitry Yusupov
                                           ` (4 more replies)
  0 siblings, 5 replies; 91+ messages in thread
From: David S. Miller @ 2005-03-27  6:46 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: mpm, andrea, michaelc, open-iscsi, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sat, 26 Mar 2005 22:33:01 -0800
Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:

> i.e. TCP stack should call NIC driver's callback after all SKB data
> been successfully copied to the user space. At that point NIC driver
> will safely replenish HW ring. This way we could avoid most of memory
> allocations on receive.

How does this solve your problem?  This is just simple SKB recycling,
and it's a pretty old idea.

TCP packets can be held on receive for arbitrary amounts of time.

This is especially true if data is received out of order or when
packets are dropped.  We can't even wake up the user until the
holes in the sequence space are filled.

Even if data is received properly and in order, there are no hard
guarentees about when the user will get back onto the CPU to
get the data copied to it.

During these gaps in time, you will need to keep your HW receive
ring populated with packets.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:46                       ` David S. Miller
@ 2005-03-27  7:05                         ` Dmitry Yusupov
  2005-03-27  7:57                           ` David S. Miller
  2005-03-27 21:14                         ` Alex Aizman
                                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-27  7:05 UTC (permalink / raw)
  To: David S. Miller
  Cc: mpm, andrea, michaelc, open-iscsi@googlegroups.com,
	James.Bottomley, ksummit-2005-discuss, netdev

On Sat, 2005-03-26 at 22:46 -0800, David S. Miller wrote:
> On Sat, 26 Mar 2005 22:33:01 -0800
> Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:
> 
> > i.e. TCP stack should call NIC driver's callback after all SKB data
> > been successfully copied to the user space. At that point NIC driver
> > will safely replenish HW ring. This way we could avoid most of memory
> > allocations on receive.
> 
> How does this solve your problem?  This is just simple SKB recycling,
> and it's a pretty old idea.

I know. it is very old idea.

> TCP packets can be held on receive for arbitrary amounts of time.

I'm thinking about mixing existing way of doing things with guarantied
SKB recycling. It should help to storage stacks to make a progress on
receive at least.

> This is especially true if data is received out of order or when
> packets are dropped.  We can't even wake up the user until the
> holes in the sequence space are filled.
> 
> Even if data is received properly and in order, there are no hard
> guarentees about when the user will get back onto the CPU to
> get the data copied to it.
> 
> During these gaps in time, you will need to keep your HW receive
> ring populated with packets.

ethernet flow-control must take care this case.

If driver's replenish logic could mix alloc_skb/netif_rx and SKB
recycling than pause frames should never happen even with gige+
interfaces.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  7:05                         ` Dmitry Yusupov
@ 2005-03-27  7:57                           ` David S. Miller
  2005-03-27  8:18                             ` Dmitry Yusupov
  0 siblings, 1 reply; 91+ messages in thread
From: David S. Miller @ 2005-03-27  7:57 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: mpm, andrea, michaelc, open-iscsi, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sat, 26 Mar 2005 23:05:30 -0800
Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:

> > During these gaps in time, you will need to keep your HW receive
> > ring populated with packets.
> 
> ethernet flow-control must take care this case.
> 
> If driver's replenish logic could mix alloc_skb/netif_rx and SKB
> recycling than pause frames should never happen even with gige+
> interfaces.

I don't see what the big deal is if pause frames
are generated when the system is low on atomic memory
and RX allocations thus fail.

SKB recycling doesn't get the user on the cpu faster
to receive the data.  I don't understand how you expect
the recycling to be guarenteed except perhaps as a special
case for iSCSI taking in the TCP packets in the ->data_ready()
callback.  In that case it's exactly that, a special case
hack, and not something generically useful at all.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  7:57                           ` David S. Miller
@ 2005-03-27  8:18                             ` Dmitry Yusupov
  2005-03-27 18:26                               ` Mike Christie
  0 siblings, 1 reply; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-27  8:18 UTC (permalink / raw)
  To: open-iscsi@googlegroups.com
  Cc: mpm, andrea, michaelc, James.Bottomley, ksummit-2005-discuss,
	netdev

On Sat, 2005-03-26 at 23:57 -0800, David S. Miller wrote:
> On Sat, 26 Mar 2005 23:05:30 -0800
> Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:
> 
> > > During these gaps in time, you will need to keep your HW receive
> > > ring populated with packets.
> > 
> > ethernet flow-control must take care this case.
> > 
> > If driver's replenish logic could mix alloc_skb/netif_rx and SKB
> > recycling than pause frames should never happen even with gige+
> > interfaces.
> 
> I don't see what the big deal is if pause frames
> are generated when the system is low on atomic memory
> and RX allocations thus fail.

not a big deal may be. but. very interesting case when OOM causing
paging in/out and swapping device are on the same network under iSCSI
control. (disk-less setups) having reliable receive in that case is
important for making progress for READ operations.

> SKB recycling doesn't get the user on the cpu faster
> to receive the data.  I don't understand how you expect
> the recycling to be guarenteed except perhaps as a special
> case for iSCSI taking in the TCP packets in the ->data_ready()
> callback.  In that case it's exactly that, a special case
> hack, and not something generically useful at all.

right. this is what Open-iSCSI project is using for READs.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:38                       ` Matt Mackall
@ 2005-03-27 14:50                         ` Andrea Arcangeli
  0 siblings, 0 replies; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-27 14:50 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Mike Christie, Dmitry Yusupov, open-iscsi, James.Bottomley,
	netdev

On Sat, Mar 26, 2005 at 10:38:48PM -0800, Matt Mackall wrote:
> What if the number of packets queued by the time we reach the softirq
> side of the stack exceeds the available buffers?

That means they weren't for the iscsi socket and they will be discarded
right away (instead of queueing them in the sock).

> Imagine that we've got heavy DNS and iSCSI on the same box and that the box
> gets wedged in OOM such that it can't answer DNS queries. But we can't
> distinguish at receive time between DNS and iSCSI. As iSCSI is TCP, it

We don't care about performance here, if we're under a flood
attack it'll take a long time but as long as you keep discarding them
right away as soon as you notice the reservation wasn't for the current
sock, it should keep making progress and not deadlock anymore.

This is a deadlock vs non-deadlock issue, how fast the other packets
arrives is a secondary issue, we're in a slow path.

> will send repeat ACKs at relatively long intervals but the DNS clients
> will potentially continue to hammer the machine, filling the reserve
> buffers and starving out the ACKs. We've got to essentially be able to

They won't emtpy it, since they will be released immediatly. From the
ack standpoint it'll be like packet loss due network congestion, infact
this sounds close to network congestion.

> say "we are OOM, drop all traffic to sockets not flagged for storage"
> and do so quickly enough that we can eventually get the ACKs.

To do that you've to reserve a NIC for that. But the whole point of the
algo I proposed is to work fine with shared NIC to avoid the deadlock
too (it won't resolve it in a high performant way, but the issue is that
it won't be a deadlock condition anymore). And if the reserved buffer is
huge likely you won't lose many packets at all.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  8:18                             ` Dmitry Yusupov
@ 2005-03-27 18:26                               ` Mike Christie
  2005-03-27 18:31                                 ` David S. Miller
  2005-03-27 18:47                                 ` Dmitry Yusupov
  0 siblings, 2 replies; 91+ messages in thread
From: Mike Christie @ 2005-03-27 18:26 UTC (permalink / raw)
  To: open-iscsi; +Cc: mpm, andrea, James.Bottomley, ksummit-2005-discuss, netdev

Dmitry Yusupov wrote:
> On Sat, 2005-03-26 at 23:57 -0800, David S. Miller wrote:
> 
>>On Sat, 26 Mar 2005 23:05:30 -0800
>>Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:
>>
>>
>>>>During these gaps in time, you will need to keep your HW receive
>>>>ring populated with packets.
>>>
>>>ethernet flow-control must take care this case.
>>>
>>>If driver's replenish logic could mix alloc_skb/netif_rx and SKB
>>>recycling than pause frames should never happen even with gige+
>>>interfaces.
>>
>>I don't see what the big deal is if pause frames
>>are generated when the system is low on atomic memory
>>and RX allocations thus fail.
> 
> 
> not a big deal may be. but. very interesting case when OOM causing
> paging in/out and swapping device are on the same network under iSCSI
> control. (disk-less setups) having reliable receive in that case is
> important for making progress for READ operations.

reliable receive is ciritical for WRITEs. Even if the WRITE is executed
successfully on the remote device, if we cannot receive the return status
from the device the operation will fail at the iscsi driver side due to a
SCSI timeout.

> 
> 
>>SKB recycling doesn't get the user on the cpu faster
>>to receive the data.  I don't understand how you expect
>>the recycling to be guarenteed except perhaps as a special
>>case for iSCSI taking in the TCP packets in the ->data_ready()
>>callback.  In that case it's exactly that, a special case
>>hack, and not something generically useful at all.
> 
> 
> right. this is what Open-iSCSI project is using for READs.
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27 18:26                               ` Mike Christie
@ 2005-03-27 18:31                                 ` David S. Miller
  2005-03-27 19:58                                   ` Matt Mackall
  2005-03-27 21:49                                   ` Dmitry Yusupov
  2005-03-27 18:47                                 ` Dmitry Yusupov
  1 sibling, 2 replies; 91+ messages in thread
From: David S. Miller @ 2005-03-27 18:31 UTC (permalink / raw)
  To: Mike Christie
  Cc: open-iscsi, mpm, andrea, James.Bottomley, ksummit-2005-discuss,
	netdev

On Sun, 27 Mar 2005 10:26:29 -0800
Mike Christie <michaelc@cs.wisc.edu> wrote:

> reliable receive is ciritical for WRITEs. Even if the WRITE is executed
> successfully on the remote device, if we cannot receive the return status
> from the device the operation will fail at the iscsi driver side due to a
> SCSI timeout.

I keep hearing this word "reliable", it means something very
different for TCP over a transport like IP than it does
for the SCSI layer.

It is, in fact, the whole difficulty of implementing iSCSI:
being able to cope with this difference in expectations.

All I can see is that the SCSI layer's timeout is inappropriate
for something like iSCSI, not that TCP or networking needs
to change in some way.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27 18:26                               ` Mike Christie
  2005-03-27 18:31                                 ` David S. Miller
@ 2005-03-27 18:47                                 ` Dmitry Yusupov
  1 sibling, 0 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-27 18:47 UTC (permalink / raw)
  To: open-iscsi@googlegroups.com
  Cc: mpm, andrea, James.Bottomley, ksummit-2005-discuss, netdev

On Sun, 2005-03-27 at 10:26 -0800, Mike Christie wrote:
> Dmitry Yusupov wrote:
> > On Sat, 2005-03-26 at 23:57 -0800, David S. Miller wrote:
> > 
> >>On Sat, 26 Mar 2005 23:05:30 -0800
> >>Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:
> >>
> >>
> >>>>During these gaps in time, you will need to keep your HW receive
> >>>>ring populated with packets.
> >>>
> >>>ethernet flow-control must take care this case.
> >>>
> >>>If driver's replenish logic could mix alloc_skb/netif_rx and SKB
> >>>recycling than pause frames should never happen even with gige+
> >>>interfaces.
> >>
> >>I don't see what the big deal is if pause frames
> >>are generated when the system is low on atomic memory
> >>and RX allocations thus fail.
> > 
> > 
> > not a big deal may be. but. very interesting case when OOM causing
> > paging in/out and swapping device are on the same network under iSCSI
> > control. (disk-less setups) having reliable receive in that case is
> > important for making progress for READ operations.
> 
> reliable receive is ciritical for WRITEs. Even if the WRITE is executed
> successfully on the remote device, if we cannot receive the return status
> from the device the operation will fail at the iscsi driver side due to a
> SCSI timeout.

of course. forget to mention this at the first place. WRITE needs
successful SCSI response too. until that it will not be returned back to
the SCSI Mid-Layer.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27 18:31                                 ` David S. Miller
@ 2005-03-27 19:58                                   ` Matt Mackall
  2005-03-27 21:49                                   ` Dmitry Yusupov
  1 sibling, 0 replies; 91+ messages in thread
From: Matt Mackall @ 2005-03-27 19:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: Mike Christie, open-iscsi, andrea, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sun, Mar 27, 2005 at 10:31:15AM -0800, David S. Miller wrote:
> On Sun, 27 Mar 2005 10:26:29 -0800
> Mike Christie <michaelc@cs.wisc.edu> wrote:
> 
> > reliable receive is ciritical for WRITEs. Even if the WRITE is executed
> > successfully on the remote device, if we cannot receive the return status
> > from the device the operation will fail at the iscsi driver side due to a
> > SCSI timeout.
> 
> I keep hearing this word "reliable", it means something very
> different for TCP over a transport like IP than it does
> for the SCSI layer.

This has nothing to do with the specifics of TCP or IP.

We are out of memory. To free memory, we must be able to send N
packets and receive M acknowledgements. Sending and receiving packets
requires allocations - if we cannot allocate, we are permanently
wedged.

This strongly suggests we need to have private reserves that the
network layer knows how to access to fullfill higher level write
requests. A closely analogous situation exists with regular SCSI,
where the transport is also potentially lossy.

(In the iSCSI case, it's somewhat worse: we may need to open a new
socket or have one in reserve.)

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:46                       ` David S. Miller
  2005-03-27  7:05                         ` Dmitry Yusupov
@ 2005-03-27 21:14                         ` Alex Aizman
       [not found]                         ` <20050327211506.85EDA16022F6@mx1.suse.de>
                                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 91+ messages in thread
From: Alex Aizman @ 2005-03-27 21:14 UTC (permalink / raw)
  To: open-iscsi
  Cc: mpm, andrea, michaelc, James.Bottomley, netdev,
	'David S. Miller', ksummit-2005-discuss

David S. Miller writes:
> 
> On Sat, 26 Mar 2005 22:33:01 -0800
> Dmitry Yusupov <dmitry_yus@yahoo.com> wrote:
> 
> > i.e. TCP stack should call NIC driver's callback after all SKB data 
> > been successfully copied to the user space. At that point 
> NIC driver 
> > will safely replenish HW ring. This way we could avoid most 
> of memory 
> > allocations on receive.
> 
> How does this solve your problem?  This is just simple SKB 
> recycling, and it's a pretty old idea.
> 
> TCP packets can be held on receive for arbitrary amounts of time.
> 
> This is especially true if data is received out of order or 
> when packets are dropped.  We can't even wake up the user 
> until the holes in the sequence space are filled.
> 
> Even if data is received properly and in order, there are no 
> hard guarentees about when the user will get back onto the 
> CPU to get the data copied to it.
> 
> During these gaps in time, you will need to keep your HW 
> receive ring populated with packets.

Here's the way I see it. 

1) There are iSCSI connections that should be "protected", resources-wise.
Examples: remote swap device, bank accounts database on RAID accessed via
iSCSI, etc.

2) There are two ways to protect the "protected" connections. One "Big
Brother" like way is a centralized Resource Manager that performs a fully
deterministic resource accounting throughout the system, all the way from
NIC descriptors and on-chip memory up to iSCSI buffers for Data-Out headers.

3) The 2nd way is *awareness* of the "protected" connections propagated
throughout the system, along with incremental implementation of more
sophisticated recovery schemes.

4) The Resource Manager could be used in the following way. At session open
time iSCSI control plane calculates iSCSI and TCP resources that should be
available at all times. The calculation is done based on: the number of SCSI
commands to be processed in parallel (the 'can_queue'), the maximum size of
the SCSI payload in the SG, the negotiated maximum number of outstanding
R2Ts, sizes of Immediate and FirstBurst data. 

5) If Resource manager says there is not enough resources, iSCSI fails
session open. This is better than to get in trouble well into runtime.

6) For example: to transmit 'can_queue' commands, iSCSI needs N skbufs.
Let's say, all N commands transmitted in a burst, and just one of these N
gets ack-ed by the Target (via StatSN). In the fully deterministic system
this does not necessarily mean that the scsi-ml can now send one command -
because the full condition involves also recycling of skbuf(s) used for
transmitting this one completed command. And although it is hard to imagine
that the command gets fully done by the remote target without Tx buffers
getting recycled, the theoretical chance exists (e.g., the NIC is slow or
the driver has a bad Tx recycling implementation), and the fully
deterministic scheme should take it into account.

7) Therefore, prior to calling scsi_done() iSCSI asks Resource Manager
whether all the TCP etc. resources used for this command are already
recycled. If not, the scsi_done() gets postponed. In addition, iSCSI
"complains" to Resource Manager that it enters slow path because of this,
which could prompt the latter to take an action. (End of the example).

8) If we agree to declare some connections "resource-proteced", it would
immediately mean that there are possibly other connections that are not
(resource-protected). Which in turn gives the Resource Manager a flexibility
to OOM-kill those unprotected connections and cannibalize the corresponding
resources for the protected ones.

9) Without some awareness of the resource-protected connections, and without
some kind of resource counting at runtime (let it be partial and incomplete
for starters) - the only remaining way for customers that require HA (High
Availability) is to over-engineer: use 64GB RAM, TBs of disk space, etc.
Which is probably not the end of the world as long as the prices go down..

Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27 18:31                                 ` David S. Miller
  2005-03-27 19:58                                   ` Matt Mackall
@ 2005-03-27 21:49                                   ` Dmitry Yusupov
  1 sibling, 0 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-27 21:49 UTC (permalink / raw)
  To: open-iscsi@googlegroups.com
  Cc: Mike Christie, mpm, andrea, James.Bottomley, ksummit-2005-discuss,
	netdev

On Sun, 2005-03-27 at 10:31 -0800, David S. Miller wrote:
> On Sun, 27 Mar 2005 10:26:29 -0800
> Mike Christie <michaelc@cs.wisc.edu> wrote:
> 
> > reliable receive is ciritical for WRITEs. Even if the WRITE is executed
> > successfully on the remote device, if we cannot receive the return status
> > from the device the operation will fail at the iscsi driver side due to a
> > SCSI timeout.
> 
> I keep hearing this word "reliable", it means something very
> different for TCP over a transport like IP than it does
> for the SCSI layer.
> 
> It is, in fact, the whole difficulty of implementing iSCSI:
> being able to cope with this difference in expectations.

actually, you absolutely right!

there are two ways to achieve reasonable "reliability" of iSCSI
transport:

1) ERL=2(see RFC3720) + multiple connections (read: single iSCSI session
with multiple connections)
2) ERL=0 + device multipath (read: multiple iSCSI sessions with single
connection per session)

I think we should slow down on "reliability" feature since 1) and 2)
basically covers the case. All we need is to make sure that deadlock
situations during OOM will never happen.

> All I can see is that the SCSI layer's timeout is inappropriate
> for something like iSCSI, not that TCP or networking needs
> to change in some way.

right. looks like there is the way to "postpone" SCSI timeout until
iSCSI session will finally recover.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
       [not found]                         ` <20050327211506.85EDA16022F6@mx1.suse.de>
@ 2005-03-28  0:15                           ` Andrea Arcangeli
  0 siblings, 0 replies; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-28  0:15 UTC (permalink / raw)
  To: Alex Aizman
  Cc: open-iscsi, mpm, michaelc, James.Bottomley, netdev,
	'David S. Miller', ksummit-2005-discuss

On Sun, Mar 27, 2005 at 01:14:42PM -0800, Alex Aizman wrote:
> 5) If Resource manager says there is not enough resources, iSCSI fails
> session open. This is better than to get in trouble well into runtime.

Yes, this is the concept we were calling mempooling in earlier emails:
reserve the ram during session open and abort before starting up if we
fail. The kernel already does this in all I/O places.

> 9) Without some awareness of the resource-protected connections, and without
> some kind of resource counting at runtime (let it be partial and incomplete
> for starters) - the only remaining way for customers that require HA (High
> Availability) is to over-engineer: use 64GB RAM, TBs of disk space, etc.

... and most important set freepages.min to 1G or so (assuming 64bit
arch of course). It's _only_ freepages.min that matters, the total ram
doesn't matter at all, since you can mark it all dirty in a few seconds
with MAP_SHARED and then it will be lost (i.e. unfreeable) until iscsi
can succeed the write.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:46                       ` David S. Miller
                                           ` (2 preceding siblings ...)
       [not found]                         ` <20050327211506.85EDA16022F6@mx1.suse.de>
@ 2005-03-28  3:54                         ` Rik van Riel
  2005-03-28  4:34                           ` David S. Miller
                                             ` (2 more replies)
  2005-03-28 19:45                         ` Roland Dreier
  4 siblings, 3 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-28  3:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

On Sat, 26 Mar 2005, David S. Miller wrote:

> How does this solve your problem?  This is just simple SKB recycling,
> and it's a pretty old idea.
> 
> TCP packets can be held on receive for arbitrary amounts of time.

We shouldn't do that when we're really really low on
memory.  I envision something like this:

1) have iSCSI, NFS, etc. open their sockets with a socket
   option that indicates this is a VM deadlock sensitive
   socket (SO_MEMALLOC?) - these sockets get two mempools,
   one for sending and one for receiving
2) have a global emergency mempool available to receive network
   packets when GFP_ATOMIC fails - this is useful since we don't
   know who a packet is for when we get the NIC interrupt, and
   it's easy to have just one pool to check
3) when a packet is received through this mempool, check
   whether the packet is for an SO_MEMALLOC socket
   ==> if not, discard the packet, free the memory
4) if the packet is for an SO_MEMALLOC socket, and that
   socket has space left in its own receiving mempool,
   and the packet is not out of order, then transfer the
   data to the socket
   ==> at this point, the space in the global network
   receive mempool can be freed again
5) if we cannot handle the packet, drop it

This way:
1) memory critical sockets are protected from other network traffic
2) memory critical sockets are protected from each other
3) all memory critical sockets should be able to make progress

The only thing left would be the memory needed to do reconnects
to the storage device, while in this situation.  I suspect we may
be able to keep that out of the network layerr, if we allow drivers
like iSCSI to pass in their own memory reserve to get the mempool
populated.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28  3:54                         ` Rik van Riel
@ 2005-03-28  4:34                           ` David S. Miller
  2005-03-28  4:50                             ` Rik van Riel
  2005-03-28  6:58                           ` Alex Aizman
  2005-03-28 16:12                           ` Andi Kleen
  2 siblings, 1 reply; 91+ messages in thread
From: David S. Miller @ 2005-03-28  4:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dmitry_yus, mpm, andrea, michaelc, open-iscsi, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sun, 27 Mar 2005 22:54:11 -0500 (EST)
Rik van Riel <riel@redhat.com> wrote:

> We shouldn't do that when we're really really low on
> memory.  I envision something like this:
 ...
>    (SO_MEMALLOC?)

Interesting top-down scheme.

We could also make a way to adjust the GFP_ATOMIC reserve
thresholds too.  That seems a bit more generic.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28  4:34                           ` David S. Miller
@ 2005-03-28  4:50                             ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-28  4:50 UTC (permalink / raw)
  To: David S. Miller
  Cc: dmitry_yus, mpm, andrea, michaelc, open-iscsi, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sun, 27 Mar 2005, David S. Miller wrote:
> On Sun, 27 Mar 2005 22:54:11 -0500 (EST)
> Rik van Riel <riel@redhat.com> wrote:
> 
> >    (SO_MEMALLOC?)
> 
> Interesting top-down scheme.
> 
> We could also make a way to adjust the GFP_ATOMIC reserve
> thresholds too.  That seems a bit more generic.

That might work, as long as we can guarantee that each
SO_MEMALLOC socket has at least N amount of memory available
just for itself, so none of these sockets get to deadlock.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28  3:54                         ` Rik van Riel
  2005-03-28  4:34                           ` David S. Miller
@ 2005-03-28  6:58                           ` Alex Aizman
  2005-03-28 16:12                           ` Andi Kleen
  2 siblings, 0 replies; 91+ messages in thread
From: Alex Aizman @ 2005-03-28  6:58 UTC (permalink / raw)
  To: open-iscsi, 'David S. Miller'
  Cc: 'Dmitry Yusupov', mpm, andrea, michaelc, James.Bottomley,
	ksummit-2005-discuss, netdev

Rik van Riel wrote:
>
> 1) have iSCSI, NFS, etc. open their sockets with a socket
>    option that indicates this is a VM deadlock sensitive
>    socket (SO_MEMALLOC?) - these sockets get two mempools,
>    one for sending and one for receiving
> 2) have a global emergency mempool available to receive network
>    packets when GFP_ATOMIC fails - this is useful since we don't
>    know who a packet is for when we get the NIC interrupt, and
>    it's easy to have just one pool to check
> 3) when a packet is received through this mempool, check
>    whether the packet is for an SO_MEMALLOC socket
>    ==> if not, discard the packet, free the memory
> 4) if the packet is for an SO_MEMALLOC socket, and that
>    socket has space left in its own receiving mempool,
>    and the packet is not out of order, then transfer the
>    data to the socket
>    ==> at this point, the space in the global network
>    receive mempool can be freed again
> 5) if we cannot handle the packet, drop it
> 

Let's say, there are only iSCSI and NFS sockets (no UDP, which is some
relief :-), and each opened with SO_MEMALLOC. The sockets are allowed to
oversubscribe (via tcp_rmem, tcp_wmem etc. defaults and socket options),
which means: the total amount of mempools memory can get beyond physically
available. It often does, actually. Which means: the 5) on the list above.
Once we allow for the non-determinism of an occasional packet drop, there's
a chance for the retransmission to not go through.

Otherwise, it's a great incremental step. The stack (for starters, and
ingoring for now resources of the NIC, iSCSI itself, etc.) needs to be aware
that some connections are more important than others. Some are
"resource-protected", others not. This is a useful piece of information.

Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28  3:54                         ` Rik van Riel
  2005-03-28  4:34                           ` David S. Miller
  2005-03-28  6:58                           ` Alex Aizman
@ 2005-03-28 16:12                           ` Andi Kleen
  2005-03-28 16:22                             ` Andrea Arcangeli
                                               ` (3 more replies)
  2 siblings, 4 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-28 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

Rik van Riel <riel@redhat.com> writes:
>    one for sending and one for receiving
> 2) have a global emergency mempool available to receive network
>    packets when GFP_ATOMIC fails - this is useful since we don't
>    know who a packet is for when we get the NIC interrupt, and
>    it's easy to have just one pool to check

This does not work because mempools assume you can sleep,
and in most NIC drivers you cant while doing RX refill.
The NIC drivers can be rewritten to do this refilling in
a workqueue. But it is not clear it is useful anyways because
Linux failing to allocate a buffer is no different from
the network overflowing the hardware queue of the network
device, which Linux cannot do anything about. 

Basically a network consists of lots of interconnected
queues, and even if you try to make the Linux specific
side of the queue reliable there are lots of other queues
that can still lose packets.

With TCP that is no problem of course because in case of 
a packet loss the packet is just retransmitted. 

So in short using mempools on receiving is not needed.

Now memory allocation for writing is a different chapter.
You cannot recover from a lost writing.
The kernel currently goes even in endless loops in this
case (e.g. on TCP FIN allocation)

With the exception of routing the allocation fortunately 
usually happens there in:
- socket context: great you can use a per socket mempool
- thread context: you can sleep

With routing that is not the case, but it does not matter
because it typically does not allocate new packets, but
just sends out existing ones.

In short you need mempool only for TX. With some luck you even only
need them for the skbuff head and a small header buffer, since I would
expect iscsi TX to be typically zero copy for data and just passing
struct page *s around.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 16:12                           ` Andi Kleen
@ 2005-03-28 16:22                             ` Andrea Arcangeli
  2005-03-28 16:24                             ` Rik van Riel
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-28 16:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rik van Riel, Dmitry Yusupov, mpm, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

On Mon, Mar 28, 2005 at 06:12:39PM +0200, Andi Kleen wrote:
> So in short using mempools on receiving is not needed.

I think you are assuming that there's still some atomic memory available
sometime in the future to allocate the skb for the ack, this isn't
necessairly true.

I outlined an algo that thanks to proper mempool-like reservation and
random picking of all mempool registered on a single nic, will avoid the
deadlock for receive. The less mempools there are and the bigger they
are, the faster it will recover.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 16:12                           ` Andi Kleen
  2005-03-28 16:22                             ` Andrea Arcangeli
@ 2005-03-28 16:24                             ` Rik van Riel
  2005-03-29 15:11                               ` Andi Kleen
  2005-03-28 16:28                             ` James Bottomley
  2005-03-28 16:37                             ` Dmitry Yusupov
  3 siblings, 1 reply; 91+ messages in thread
From: Rik van Riel @ 2005-03-28 16:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

On Mon, 28 Mar 2005, Andi Kleen wrote:

> So in short using mempools on receiving is not needed.

It is, because you have to ensure that the memory that's
needed to receive network packets isn't tied up receiving
packets for non-critical sockets, which would leave the
critical sockets deadlocked.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 16:12                           ` Andi Kleen
  2005-03-28 16:22                             ` Andrea Arcangeli
  2005-03-28 16:24                             ` Rik van Riel
@ 2005-03-28 16:28                             ` James Bottomley
  2005-03-29 15:20                               ` Andi Kleen
  2005-03-28 16:37                             ` Dmitry Yusupov
  3 siblings, 1 reply; 91+ messages in thread
From: James Bottomley @ 2005-03-28 16:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rik van Riel, Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	ksummit-2005-discuss, netdev

On Mon, 2005-03-28 at 18:12 +0200, Andi Kleen wrote:
> This does not work because mempools assume you can sleep,
> and in most NIC drivers you cant while doing RX refill.
> The NIC drivers can be rewritten to do this refilling in
> a workqueue. But it is not clear it is useful anyways because
> Linux failing to allocate a buffer is no different from
> the network overflowing the hardware queue of the network
> device, which Linux cannot do anything about. 

Actually, not in 2.6 ... we had the same issue in SCSI using mempools
for sglist allocation.  All of the mempool alocation paths now take gfp_
flags, so you can specify GFP_ATOMIC for interrupt context.

> Basically a network consists of lots of interconnected
> queues, and even if you try to make the Linux specific
> side of the queue reliable there are lots of other queues
> that can still lose packets.
> 
> With TCP that is no problem of course because in case of 
> a packet loss the packet is just retransmitted. 
> 
> So in short using mempools on receiving is not needed.

The object isn't to make the queues *reliable* it's to ensure the system
can make forward progress.  So all we're trying to ensure is that the
sockets used to service storage have some probability of being able to
send and receive packets during low memory.

In your scenario, if we're out of memory and the system needs several
ACK's to the swap device for pages to be released to the system, I don't
see how we make forward progress since without a reserved resource to
allocate from how does the ack make it up the stack to the storage
driver layer?

James

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 16:12                           ` Andi Kleen
                                               ` (2 preceding siblings ...)
  2005-03-28 16:28                             ` James Bottomley
@ 2005-03-28 16:37                             ` Dmitry Yusupov
  3 siblings, 0 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-28 16:37 UTC (permalink / raw)
  To: open-iscsi@googlegroups.com
  Cc: Rik van Riel, mpm, andrea, michaelc, James.Bottomley,
	ksummit-2005-discuss, netdev

On Mon, 2005-03-28 at 18:12 +0200, Andi Kleen wrote:
> Rik van Riel <riel@redhat.com> writes:
> >    one for sending and one for receiving
> > 2) have a global emergency mempool available to receive network
> >    packets when GFP_ATOMIC fails - this is useful since we don't
> >    know who a packet is for when we get the NIC interrupt, and
> >    it's easy to have just one pool to check
> 
> This does not work because mempools assume you can sleep,
> and in most NIC drivers you cant while doing RX refill.
> The NIC drivers can be rewritten to do this refilling in
> a workqueue. But it is not clear it is useful anyways because
> Linux failing to allocate a buffer is no different from
> the network overflowing the hardware queue of the network
> device, which Linux cannot do anything about. 
> 
> Basically a network consists of lots of interconnected
> queues, and even if you try to make the Linux specific
> side of the queue reliable there are lots of other queues
> that can still lose packets.
> 
> With TCP that is no problem of course because in case of 
> a packet loss the packet is just retransmitted. 
> 
> So in short using mempools on receiving is not needed.
> 
> Now memory allocation for writing is a different chapter.
> You cannot recover from a lost writing.
> The kernel currently goes even in endless loops in this
> case (e.g. on TCP FIN allocation)
> 
> With the exception of routing the allocation fortunately 
> usually happens there in:
> - socket context: great you can use a per socket mempool
> - thread context: you can sleep
> 
> With routing that is not the case, but it does not matter
> because it typically does not allocate new packets, but
> just sends out existing ones.
> 
> In short you need mempool only for TX. With some luck you even only
> need them for the skbuff head and a small header buffer, since I would
> expect iscsi TX to be typically zero copy for data and just passing
> struct page *s around.

that's always true for Open-iSCSI and for sfnet in FFP(rcf3720) but not
for other implementations. moreover, if NIC does not support
checksumming on Tx than sendpage() interface will fail to sendmsg()
which will allocate new skb each time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-27  6:46                       ` David S. Miller
                                           ` (3 preceding siblings ...)
  2005-03-28  3:54                         ` Rik van Riel
@ 2005-03-28 19:45                         ` Roland Dreier
  2005-03-28 20:32                           ` Topic: Remote DMA network technologies Gerrit Huizenga
       [not found]                           ` <1112042936.5088.22.camel@beastie>
  4 siblings, 2 replies; 91+ messages in thread
From: Roland Dreier @ 2005-03-28 19:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

Let me slightly hijack this thread to throw out another topic that I
think is worth talking about at the kernel summit: handling remote DMA
(RDMA) network technologies.

As some of you might know, I'm one of the main authors of the
InfiniBand support in the kernel, and I think we have things fairly
well in hand there, although handling direct userspace access to RDMA
capabilities may raise some issues worth talking about.

However, there is also RDMA-over-TCP hardware beginning to be used,
based on the specs from the IETF rddp working group and the RDMA
Consortium.  I would hope that we can abstract out the common pieces
for InfiniBand and RDMA NIC (RNIC) support and morph
drivers/infiniband into a more general drivers/rdma.

This is not _that_ offtopic, since RDMA NICs provide another way of
handling OOM for iSCSI.  By having the NIC handle the network
transport through something like iSER, you avoid a lot of the issues
in this thread.  Having to reconnect to a target while OOM is still a
problem, but it seems no worse in principal than the issues with a
dump FC card that needs the host driver to handling fabric login.

I know that in the InfiniBand world, people have been able to run
stress tests of storage over SCSI RDMA Protocol (SRP) with very heavy
swapping going on and no deadlocks.  SRP is in effect network storage
with the transport handled by the IB hardware.

However there are some sticky points that I would be interested in
discussing.  For example, the IETF rddp drafts envisage what they call
a "dual stack" model: TCP connections are set up by the usual network
stack and run for a while in "streaming" mode until the application is
ready to start using RDMA.  At that point there is an "MPA"
negotiation and then the socket is handed over to the RNIC.  Clearly
moving the state from the kernel's stack to the RNIC is not trivial.

Other developers who have more direct experience with RNIC hardware or
perhaps just strong opinions may have other things in this area that
they'd like to talk about.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Topic:  Remote DMA network technologies
  2005-03-28 19:45                         ` Roland Dreier
@ 2005-03-28 20:32                           ` Gerrit Huizenga
  2005-03-28 20:36                             ` Roland Dreier
       [not found]                           ` <1112042936.5088.22.camel@beastie>
  1 sibling, 1 reply; 91+ messages in thread
From: Gerrit Huizenga @ 2005-03-28 20:32 UTC (permalink / raw)
  To: Roland Dreier
  Cc: David S. Miller, Dmitry Yusupov, mpm, andrea, michaelc,
	open-iscsi, James.Bottomley, ksummit-2005-discuss, netdev


[ Can we start updating the Subject line occasionally when we have
  a specific topic, like above.  And no, I'm not top-posting, I'm
  commenting on the Subject:.  ;-)  --gerrit ]

On Mon, 28 Mar 2005 11:45:19 PST, Roland Dreier wrote:
> Let me slightly hijack this thread to throw out another topic that I
> think is worth talking about at the kernel summit: handling remote DMA
> (RDMA) network technologies.
> 
> As some of you might know, I'm one of the main authors of the
> InfiniBand support in the kernel, and I think we have things fairly
> well in hand there, although handling direct userspace access to RDMA
> capabilities may raise some issues worth talking about.
> 
> However, there is also RDMA-over-TCP hardware beginning to be used,
> based on the specs from the IETF rddp working group and the RDMA
> Consortium.  I would hope that we can abstract out the common pieces
> for InfiniBand and RDMA NIC (RNIC) support and morph
> drivers/infiniband into a more general drivers/rdma.
> 
> This is not _that_ offtopic, since RDMA NICs provide another way of
> handling OOM for iSCSI.  By having the NIC handle the network
> transport through something like iSER, you avoid a lot of the issues
> in this thread.  Having to reconnect to a target while OOM is still a
> problem, but it seems no worse in principal than the issues with a
> dump FC card that needs the host driver to handling fabric login.
> 
> I know that in the InfiniBand world, people have been able to run
> stress tests of storage over SCSI RDMA Protocol (SRP) with very heavy
> swapping going on and no deadlocks.  SRP is in effect network storage
> with the transport handled by the IB hardware.
> 
> However there are some sticky points that I would be interested in
> discussing.  For example, the IETF rddp drafts envisage what they call
> a "dual stack" model: TCP connections are set up by the usual network
> stack and run for a while in "streaming" mode until the application is
> ready to start using RDMA.  At that point there is an "MPA"
> negotiation and then the socket is handed over to the RNIC.  Clearly
> moving the state from the kernel's stack to the RNIC is not trivial.
> 
> Other developers who have more direct experience with RNIC hardware or
> perhaps just strong opinions may have other things in this area that
> they'd like to talk about.
> 
> Thanks,
>   Roland
> 
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Topic:  Remote DMA network technologies
  2005-03-28 20:32                           ` Topic: Remote DMA network technologies Gerrit Huizenga
@ 2005-03-28 20:36                             ` Roland Dreier
  0 siblings, 0 replies; 91+ messages in thread
From: Roland Dreier @ 2005-03-28 20:36 UTC (permalink / raw)
  To: Gerrit Huizenga; +Cc: open-iscsi, ksummit-2005-discuss, netdev

    Gerrit> [ Can we start updating the Subject line occasionally when
    Gerrit> we have a specific topic, like above.  And no, I'm not
    Gerrit> top-posting, I'm commenting on the Subject:.  ;-) --gerrit ]

Sorry -- your point is well taken, but everyone else was having so
much fun discussing iSCSI specifics under the old Subject line that I
just went with the flow ;)

 - Roland

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
       [not found]                           ` <1112042936.5088.22.camel@beastie>
@ 2005-03-28 22:32                             ` Benjamin LaHaise
  2005-03-29  3:19                               ` Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Roland Dreier
  2005-04-02 18:08                               ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Dmitry Yusupov
  2005-03-29  3:14                             ` Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Roland Dreier
  1 sibling, 2 replies; 91+ messages in thread
From: Benjamin LaHaise @ 2005-03-28 22:32 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: open-iscsi, David S. Miller, mpm, andrea, michaelc,
	James.Bottomley, ksummit-2005-discuss, netdev

On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote:
> If you have plans to start new project such as SoftRDMA than yes. lets
> discuss it since set of problems will be similar to what we've got with
> software iSCSI Initiators.

I'm somewhat interested in seeing a SoftRDMA project get off the ground.  
At least the NatSemi 83820 gige MAC is able to provide early-rx interrupts 
that allow one to get an rx interrupt before the full payload has arrived 
making it possible to write out a new rx descriptor to place the payload 
wherever it is ultimately desired.  It would be fun to work on if not the 
most performant RDMA implementation.

> I'm not a believer in any HW state-full protocol offloading technologies
> and that was one of my motivations to initiate Open-iSCSI project to
> prove that performance is not an issue anymore. And we succeeded, by
> showing comparable to iSCSI HW Initiator's numbers.

Agreed.  After working on a full TOE implementation, I think that the 
niche market most TOE vendors are pursuing is not one that the Linux 
community will ever develop for.  Hardware vendors that gradually add 
offloading features from the NIC realm to speed up the existing network 
stack are a much better fit with Linux.

> Though, for me, RDMA over TCP is an interesting topic from software
> implementation point of view. I was thinking about organizing new
> project. If someone knows that related work is already started - let me
> know since I might be interested to help.

Shall we create a new mailing list?  I guess it's time to update 
majordomo... =)

		-ben

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics)
       [not found]                           ` <1112042936.5088.22.camel@beastie>
  2005-03-28 22:32                             ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Benjamin LaHaise
@ 2005-03-29  3:14                             ` Roland Dreier
  1 sibling, 0 replies; 91+ messages in thread
From: Roland Dreier @ 2005-03-29  3:14 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: open-iscsi, David S. Miller, mpm, andrea, michaelc,
	James.Bottomley, ksummit-2005-discuss, netdev

    Dmitry> Basically, HW offloading all kind of is a different
    Dmitry> subject.  Yes, iSER/RDMA/RNIC will help to avoid bunch of
    Dmitry> problems but at the same time will add bunch of new
    Dmitry> problems. OOM/deadlock problem we are discussing is a
    Dmitry> software, *not* hardware related.

Yes, that's why I said I was hijacking the topic to bring up something
else I was interested in :)

    Dmitry> If you have plans to start new project such as SoftRDMA
    Dmitry> than yes. lets discuss it since set of problems will be
    Dmitry> similar to what we've got with software iSCSI Initiators.

No, I don't have plans for such a project, although I would be
interesting in participating in a small way.  Unfortunately I'm
involved in too many other things on to do much real work.

My main interest comes from the InfiniBand world.  Right now we have
the beginnings of good support for IB in drivers/infiniband, but
people are always talking to me about adding support for RDMA/TCP
hardware.  I think we should be able to evolve the curent InfiniBand
API to a more generic RDMA API, and I would hope that a "SoftRDMA"
project can fit in as just another low-level device driver (soft of
the same way software iSCSI sits under the SCSI stack).

In fact I think SoftRDMA would be very good for this generalization
work, as it would force us to come up with very flexible APIs.

    Dmitry> I'm not a believer in any HW state-full protocol
    Dmitry> offloading technologies and that was one of my motivations
    Dmitry> to initiate Open-iSCSI project to prove that performance
    Dmitry> is not an issue anymore. And we succeeded, by showing
    Dmitry> comparable to iSCSI HW Initiator's numbers.

Fair enough.  I think I agree that HW offload is not really justified
if all you care about is storage, although a cheap iSCSI HBA than
handles all the transport and just lets the host queue IOs seems like
a reasonable thing to put in a server that has work to do beyond
running a storage stack.

It seems that many people are using RDMA hardware (mostly InfiniBand
now, maybe RDMA/TCP will catch on) hardware for other reasons.  In
those cases users often want to share the same fabric and NIC for
storage too.  But my main interest right now is in getting RDMA
working well on Linux for the users that are already out there -- I
know many IB clusters with hundreds and even thousands of nodes are
being built all the time, so InfiniBand must be solving some real
problems for users.

 - R.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics)
  2005-03-28 22:32                             ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Benjamin LaHaise
@ 2005-03-29  3:19                               ` Roland Dreier
  2005-03-30 16:00                                 ` Benjamin LaHaise
  2005-04-02 18:08                               ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Dmitry Yusupov
  1 sibling, 1 reply; 91+ messages in thread
From: Roland Dreier @ 2005-03-29  3:19 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dmitry Yusupov, open-iscsi, David S. Miller, mpm, andrea,
	michaelc, James.Bottomley, ksummit-2005-discuss, netdev

    Benjamin> Agreed.  After working on a full TOE implementation, I
    Benjamin> think that the niche market most TOE vendors are
    Benjamin> pursuing is not one that the Linux community will ever
    Benjamin> develop for.  Hardware vendors that gradually add
    Benjamin> offloading features from the NIC realm to speed up the
    Benjamin> existing network stack are a much better fit with Linux.

I have to admit I don't know much about the TOE / RDMA/TCP / RNIC (or
whatever you want to call it) world.  However I know that the large
majority of InfiniBand use right now is running on Linux, and I hope
the Linux community is willing to work with the IB community.

InfiniBand adoption is strong right now, with lots of large clusters
being built.  It seems reasonable that RDMA/TCP should be able to
compete in the same market.  Whether InfiniBand or RDMA/TCP or both
will survive or prosper is a good question, and I think it's too early
to tell yet.

 - R.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 16:24                             ` Rik van Riel
@ 2005-03-29 15:11                               ` Andi Kleen
  2005-03-29 15:29                                 ` Rik van Riel
  2005-03-29 17:03                                 ` Matt Mackall
  0 siblings, 2 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-29 15:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

On Mon, Mar 28, 2005 at 11:24:55AM -0500, Rik van Riel wrote:
> On Mon, 28 Mar 2005, Andi Kleen wrote:
> 
> > So in short using mempools on receiving is not needed.
> 
> It is, because you have to ensure that the memory that's
> needed to receive network packets isn't tied up receiving
> packets for non-critical sockets, which would leave the
> critical sockets deadlocked.

Again the in socket queue is in no way different from all
the tens of hundreds of limited size queues that make 
up a network. It is quite useless to concentrate on only
one queue in the receiver computer, while all the others still can lose 
packets.

The only way to solve such problems in the TCP/IP model
is to retransmit at the source. This means the TCP write
path needs to be reliable, but receiving does not need to be.

TCP will continuing retransmitting for hours.  If you network
system is so tied up that you cannot receive anything for hours
then yes youre screwed, but I doubt memory reservation will fix
such extreme problems.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 16:28                             ` James Bottomley
@ 2005-03-29 15:20                               ` Andi Kleen
  2005-03-29 15:56                                 ` James Bottomley
  2005-03-29 17:19                                 ` Dmitry Yusupov
  0 siblings, 2 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-29 15:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Rik van Riel, Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	ksummit-2005-discuss, netdev

On Mon, Mar 28, 2005 at 10:28:04AM -0600, James Bottomley wrote:
> On Mon, 2005-03-28 at 18:12 +0200, Andi Kleen wrote:
> > This does not work because mempools assume you can sleep,
> > and in most NIC drivers you cant while doing RX refill.
> > The NIC drivers can be rewritten to do this refilling in
> > a workqueue. But it is not clear it is useful anyways because
> > Linux failing to allocate a buffer is no different from
> > the network overflowing the hardware queue of the network
> > device, which Linux cannot do anything about. 
> 
> Actually, not in 2.6 ... we had the same issue in SCSI using mempools
> for sglist allocation.  All of the mempool alocation paths now take gfp_
> flags, so you can specify GFP_ATOMIC for interrupt context.

Just does not work when you are actually short of memory.

Just think a second on how a mempool works: In the extreme
case when it cannot allocate system memory anymore it has
to wait for someone else to free a memory block into the mempool,
then pass it on to the next allocator etc. Basically 
it is a direct bypass pipeline for memory to pass memory
directly from one high priority user to another. This only
works with sleeping. Otherwise you could not handle an arbitary
number of users with a single mempool.

So to get a reliable mempool you have to sleep on allocation.

> The object isn't to make the queues *reliable* it's to ensure the system
> can make forward progress.  So all we're trying to ensure is that the
> sockets used to service storage have some probability of being able to
> send and receive packets during low memory.

For that it is enough to make the sender reliable. Retransmit
takes care of the rest.

> In your scenario, if we're out of memory and the system needs several
> ACK's to the swap device for pages to be released to the system, I don't
> see how we make forward progress since without a reserved resource to
> allocate from how does the ack make it up the stack to the storage
> driver layer?

Typically because the RX ring of the driver has some packets left.

Also since TCP is very persistent and there is some memory
activity left you will have at least occasionally a time slot
where a GFP_ATOMIC allocation can succeed.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 15:11                               ` Andi Kleen
@ 2005-03-29 15:29                                 ` Rik van Riel
  2005-03-29 17:03                                 ` Matt Mackall
  1 sibling, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-29 15:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

On Tue, 29 Mar 2005, Andi Kleen wrote:
> On Mon, Mar 28, 2005 at 11:24:55AM -0500, Rik van Riel wrote:
> > On Mon, 28 Mar 2005, Andi Kleen wrote:
> > 
> > > So in short using mempools on receiving is not needed.
> > 
> > It is, because you have to ensure that the memory that's
> > needed to receive network packets isn't tied up receiving
> > packets for non-critical sockets, which would leave the
> > critical sockets deadlocked.
> 
> Again the in socket queue is in no way different from all
> the tens of hundreds of limited size queues that make 
> up a network. It is quite useless to concentrate on only
> one queue in the receiver computer, while all the others
> still can lose packets.

But ... are the packets already received by the network
stack dropped if memory is really low, so we can process
the packets for the memor critical sockets ?

If packets received for non-critical sockets can exhaust
memory, we will deadlock - and that could be the critical
difference between a router (which dumps all packets after
some time) and a Linux host running iSCSI...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 15:20                               ` Andi Kleen
@ 2005-03-29 15:56                                 ` James Bottomley
  2005-03-29 17:19                                 ` Dmitry Yusupov
  1 sibling, 0 replies; 91+ messages in thread
From: James Bottomley @ 2005-03-29 15:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rik van Riel, Dmitry Yusupov, mpm, andrea, michaelc, open-iscsi,
	ksummit-2005-discuss, netdev

On Tue, 2005-03-29 at 17:20 +0200, Andi Kleen wrote:
> > Actually, not in 2.6 ... we had the same issue in SCSI using mempools
> > for sglist allocation.  All of the mempool alocation paths now take gfp_
> > flags, so you can specify GFP_ATOMIC for interrupt context.
> 
> Just does not work when you are actually short of memory.
> 
> Just think a second on how a mempool works: In the extreme
> case when it cannot allocate system memory anymore it has
> to wait for someone else to free a memory block into the mempool,
> then pass it on to the next allocator etc. Basically 
> it is a direct bypass pipeline for memory to pass memory
> directly from one high priority user to another. This only
> works with sleeping. Otherwise you could not handle an arbitary
> number of users with a single mempool.
> 
> So to get a reliable mempool you have to sleep on allocation.

But that's not what we use them for.  You are confusing reliability with
forward progress.

In SCSI we use GFP_ATOMIC mempools in order to make forward progress.
All the paths are coded to expect a failure (in which case we requeue).
For forward progress what we need is the knowledge that there are n
resources out there dedicated to us.  When they return they get
reallocated straight to us and we can restart the queue processing
(there's actually a SCSI trigger that does this).

For receive mempools, the situation is much the same; if you have n
reserved buffers, then you have to drop the n+1 th packet.  However, the
resources will free up and go back to your mempool, and eventually you
accept the packet on retransmit.

The killer scenario (and why we require a mempool) is that someone else
gets the memory before you but then becomes blocked on another
allocation, so now you have no more allocations to allow forward
progress.

James

> > The object isn't to make the queues *reliable* it's to ensure the system
> > can make forward progress.  So all we're trying to ensure is that the
> > sockets used to service storage have some probability of being able to
> > send and receive packets during low memory.
> 
> For that it is enough to make the sender reliable. Retransmit
> takes care of the rest.

No ... we cannot get down to the situation where GFP_ATOMIC always
fails.  Now we have no receive capacity at all and the system deadlocks.

> > In your scenario, if we're out of memory and the system needs several
> > ACK's to the swap device for pages to be released to the system, I don't
> > see how we make forward progress since without a reserved resource to
> > allocate from how does the ack make it up the stack to the storage
> > driver layer?
> 
> Typically because the RX ring of the driver has some packets left.
> 
> Also since TCP is very persistent and there is some memory
> activity left you will have at least occasionally a time slot
> where a GFP_ATOMIC allocation can succeed.

That's what I think a mempool is required to guarantee.  Without it,
there are scenarios where GFP_ATOMIC always fails.

James

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 15:11                               ` Andi Kleen
  2005-03-29 15:29                                 ` Rik van Riel
@ 2005-03-29 17:03                                 ` Matt Mackall
  1 sibling, 0 replies; 91+ messages in thread
From: Matt Mackall @ 2005-03-29 17:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rik van Riel, Dmitry Yusupov, andrea, michaelc, open-iscsi,
	James.Bottomley, ksummit-2005-discuss, netdev

On Tue, Mar 29, 2005 at 05:11:59PM +0200, Andi Kleen wrote:
> On Mon, Mar 28, 2005 at 11:24:55AM -0500, Rik van Riel wrote:
> > On Mon, 28 Mar 2005, Andi Kleen wrote:
> > 
> > > So in short using mempools on receiving is not needed.
> > 
> > It is, because you have to ensure that the memory that's
> > needed to receive network packets isn't tied up receiving
> > packets for non-critical sockets, which would leave the
> > critical sockets deadlocked.
> 
> Again the in socket queue is in no way different from all
> the tens of hundreds of limited size queues that make 
> up a network. It is quite useless to concentrate on only
> one queue in the receiver computer, while all the others still can lose 
> packets.
> 
> The only way to solve such problems in the TCP/IP model
> is to retransmit at the source. This means the TCP write
> path needs to be reliable, but receiving does not need to be.

You don't seem to understand the deadlock yet. Host is OOM. Host must
flush pages to target to free memory. Host manages to draw skbs from
private reserve to do its writes. Target acknowledges writes, but host
is _still OOM_ and there is no memory including GFP_ATOMIC to allocate
receive buffers. Retransmission doesn't help because the
acknowledgements will always be dropped.

So it seems we need a way to receive such acknowledgements and quickly
discard everything else when we're pushed up against OOM.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 15:20                               ` Andi Kleen
  2005-03-29 15:56                                 ` James Bottomley
@ 2005-03-29 17:19                                 ` Dmitry Yusupov
  2005-03-29 21:08                                   ` jamal
  2005-03-30  5:12                                   ` H. Peter Anvin
  1 sibling, 2 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-29 17:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Rik van Riel, mpm, andrea, michaelc, open-iscsi,
	ksummit-2005-discuss, netdev

On Tue, 2005-03-29 at 17:20 +0200, Andi Kleen wrote:
> > In your scenario, if we're out of memory and the system needs several
> > ACK's to the swap device for pages to be released to the system, I don't
> > see how we make forward progress since without a reserved resource to
> > allocate from how does the ack make it up the stack to the storage
> > driver layer?
> 
> Typically because the RX ring of the driver has some packets left.

You can not be sure. Some NICs has very small number for possible HW
ring buffers. Under OOM pressure, most likely, host will be so slow that
resources might not just be returned back to the HW in time. Though it
depends on link-layer driver implementation.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 17:19                                 ` Dmitry Yusupov
@ 2005-03-29 21:08                                   ` jamal
  2005-03-29 22:00                                     ` Rik van Riel
                                                       ` (3 more replies)
  2005-03-30  5:12                                   ` H. Peter Anvin
  1 sibling, 4 replies; 91+ messages in thread
From: jamal @ 2005-03-29 21:08 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: Andi Kleen, James Bottomley, Rik van Riel, mpm, andrea, michaelc,
	open-iscsi, ksummit-2005-discuss, netdev

On Tue, 2005-03-29 at 12:19, Dmitry Yusupov wrote:
> On Tue, 2005-03-29 at 17:20 +0200, Andi Kleen wrote:
> > > In your scenario, if we're out of memory and the system needs several
> > > ACK's to the swap device for pages to be released to the system, I don't
> > > see how we make forward progress since without a reserved resource to
> > > allocate from how does the ack make it up the stack to the storage
> > > driver layer?
> > 
> > Typically because the RX ring of the driver has some packets left.
> 
> You can not be sure. Some NICs has very small number for possible HW
> ring buffers. Under OOM pressure, most likely, host will be so slow that
> resources might not just be returned back to the HW in time. Though it
> depends on link-layer driver implementation.
> 

I didnt quiet follow the discussion - Let me see if i can phrase the
problem correctly (Trying to speak in general terms):

Sender is holding onto memory (retransmit queue i assume) waiting
for ACKs. Said sender is under OOM and therefore drops ACKs coming in
and as a result cant let go of these precious resource sitting on the
retransmit queue. 
And iscsi cant wait long enough for someone else to release memory so
the ACKs can be delivered. 
Did i capture this correctly?

If yes, the solution maybe to just drop all non-high-prio packets coming
in during the denial of service attack (for lack of better term). In
other words some strict prioritization scheduling (or rate control) at
the network level either in the NIC or ingress qdisc level.

On a slightly related topic: is SCSI (not iscsi) considered a reliable
protocol?
If yes, why would you wanna run a reliable protocol inside another
reliable protocol (TCP)?

cheers,
jamal

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 21:08                                   ` jamal
@ 2005-03-29 22:00                                     ` Rik van Riel
  2005-03-29 22:17                                       ` Matt Mackall
  2005-03-29 23:00                                       ` jamal
  2005-03-29 22:03                                     ` Rick Jones
                                                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-29 22:00 UTC (permalink / raw)
  To: jamal
  Cc: Dmitry Yusupov, Andi Kleen, James Bottomley, mpm, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, 29 Mar 2005, jamal wrote:

> If yes, the solution maybe to just drop all non-high-prio packets coming
> in during the denial of service attack (for lack of better term). In
> other words some strict prioritization scheduling (or rate control) at
> the network level either in the NIC or ingress qdisc level.

Exactly, that is the proposal.  However, we often will need
to get the packets off the network card before we can decide
whether or not they're high priority.

Also, there can be multiple high priority sockets, and we
need to ensure they all make progress.  Hence the mempool
idea.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 21:08                                   ` jamal
  2005-03-29 22:00                                     ` Rik van Riel
@ 2005-03-29 22:03                                     ` Rick Jones
  2005-03-29 23:13                                       ` jamal
  2005-03-30 15:22                                     ` Andi Kleen
  2005-03-30 17:07                                     ` Grant Grundler
  3 siblings, 1 reply; 91+ messages in thread
From: Rick Jones @ 2005-03-29 22:03 UTC (permalink / raw)
  To: netdev; +Cc: open-iscsi, ksummit-2005-discuss

jamal wrote:
> I didnt quiet follow the discussion - Let me see if i can phrase the
> problem correctly (Trying to speak in general terms):
> 
> Sender is holding onto memory (retransmit queue i assume) waiting
> for ACKs. Said sender is under OOM and therefore drops ACKs coming in
> and as a result cant let go of these precious resource sitting on the
> retransmit queue. 
> And iscsi cant wait long enough for someone else to release memory so
> the ACKs can be delivered. 
> Did i capture this correctly?
> 
> If yes, the solution maybe to just drop all non-high-prio packets coming
> in during the denial of service attack (for lack of better term). In
> other words some strict prioritization scheduling (or rate control) at
> the network level either in the NIC or ingress qdisc level.

Eventually the TCP will hit its RTX limit and punt the connection, freeing the 
buffers kept for retransmission right?

> 
> On a slightly related topic: is SCSI (not iscsi) considered a reliable
> protocol?
> If yes, why would you wanna run a reliable protocol inside another
> reliable protocol (TCP)?

Isn't it better to consider TCP a protocol that provides reliable notice of 
(presumed) failure rather than a "reliable protocol?"

rick jones

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 22:00                                     ` Rik van Riel
@ 2005-03-29 22:17                                       ` Matt Mackall
  2005-03-29 23:30                                         ` jamal
  2005-03-29 23:00                                       ` jamal
  1 sibling, 1 reply; 91+ messages in thread
From: Matt Mackall @ 2005-03-29 22:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: jamal, Dmitry Yusupov, Andi Kleen, James Bottomley, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, Mar 29, 2005 at 05:00:35PM -0500, Rik van Riel wrote:
> On Tue, 29 Mar 2005, jamal wrote:
> 
> > If yes, the solution maybe to just drop all non-high-prio packets coming
> > in during the denial of service attack (for lack of better term). In
> > other words some strict prioritization scheduling (or rate control) at
> > the network level either in the NIC or ingress qdisc level.
> 
> Exactly, that is the proposal.  However, we often will need
> to get the packets off the network card before we can decide
> whether or not they're high priority.
> 
> Also, there can be multiple high priority sockets, and we
> need to ensure they all make progress.  Hence the mempool
> idea.

I'm sure Rik realizes this, but it's important to note here that
"making progress" may require M acknowledgements to N packets
representing a single IO. So we need separate send and acknowledge
pools for each SO_MEMALLOC socket so that we don't find ourselves
wedged with M-1 available mempool slots when we're waiting on ACKs. So
accounting ACK packets to the appropriate receiver once we've figured
out what socket an ACK is intended for is critical.

Note that ACK here is the application layer command result that needs
to be propagated back to the driver (and possibly higher in the case
of things like CD writing over iSCSI) and not simply a bit in the TCP
header.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 22:00                                     ` Rik van Riel
  2005-03-29 22:17                                       ` Matt Mackall
@ 2005-03-29 23:00                                       ` jamal
  2005-03-29 23:25                                         ` Matt Mackall
  2005-03-30 15:24                                         ` Andi Kleen
  1 sibling, 2 replies; 91+ messages in thread
From: jamal @ 2005-03-29 23:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dmitry Yusupov, Andi Kleen, James Bottomley, mpm, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, 2005-03-29 at 17:00, Rik van Riel wrote:

>   However, we often will need
> to get the packets off the network card before we can decide
> whether or not they're high priority.

True - although one could argue that with NAPI that decision would be a
few opcodes away if you install the ingress qdisc. 
So you may end up allocing only to free a few cycles later. Increased
memory traffic but the discard happens sufficiently early for a s/ware
only solution and CPU cycles not burnt as much.

OTOH, even elcheapo pacific rim nics are beginning to show up with some
classifiers in the hardware as well as multiple queues or rx 
DMA rings. So you could program ACKs or all TCP packets to show up on a
higher priority ring and only process that until theres nothing left
before processing the low priority ring (i.e dont care if low priority
data/app is starved).
Probably someone going out of their way to do high performance iSCSI
would consider such hardware. 
We dont exactly support multiple rx (or tx) DMA rings however various
people seem to be promising patches that work with their hardware
(netiron, intel?)

> Also, there can be multiple high priority sockets, and we
> need to ensure they all make progress.  Hence the mempool
> idea.

Sorry missed the early part of this thread: mempool is some
strict priority scheme for mem allocation?

For Sockets: If there was a "control" arbitrator preferably in user
space which would install - after a socket open - both network ingress
and/or egress rules for prioritization then wouldnt that suffice?
The mechanisms are already in place today.

cheers,
jamal

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 22:03                                     ` Rick Jones
@ 2005-03-29 23:13                                       ` jamal
  2005-03-30  2:28                                         ` Alex Aizman
                                                           ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: jamal @ 2005-03-29 23:13 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, open-iscsi, ksummit-2005-discuss

On Tue, 2005-03-29 at 17:03, Rick Jones wrote:

> 
> Eventually the TCP will hit its RTX limit and punt the connection, freeing the 
> buffers kept for retransmission right?
> 

If i read correctly the people arguing for iscsi say thats not good
enough. But they may be having other issues too...

> > 
> > On a slightly related topic: is SCSI (not iscsi) considered a reliable
> > protocol?
> > If yes, why would you wanna run a reliable protocol inside another
> > reliable protocol (TCP)?
> 
> Isn't it better to consider TCP a protocol that provides reliable notice of 
> (presumed) failure rather than a "reliable protocol?"
> 

You could if the parameters are adequely set (i think).

If both are reliable protocols then they would both have the standard
features and parameters:

- transmit (for simplicty assume window of 1)

loop for X times
{
- compute next retransmit time, Y, using some algorithm
- wait for ACK
- timeout 
- retransmit
}

so parameters X and retransmit time is where the conflict is.
If TCP is eagerly retransmitting a lot of bandwidth could be
wasted. if SCSI has X as infinite even more interesting things
could happen.
In any case i have seen horror stories of what happened to people
who tried to encapsulate an already reliable protocol inside TCP in
order to ship it across the big bad internet. I am pretty sure some
knowledgeable people were involved in getting iscsi going so it cant be
that simple. It would seem preferable to use SCTP instead.

cheers,
jamal

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 23:00                                       ` jamal
@ 2005-03-29 23:25                                         ` Matt Mackall
  2005-03-30  0:30                                           ` H. Peter Anvin
  2005-03-30 15:24                                         ` Andi Kleen
  1 sibling, 1 reply; 91+ messages in thread
From: Matt Mackall @ 2005-03-29 23:25 UTC (permalink / raw)
  To: jamal
  Cc: Rik van Riel, Dmitry Yusupov, Andi Kleen, James Bottomley, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, Mar 29, 2005 at 06:00:11PM -0500, jamal wrote:
> On Tue, 2005-03-29 at 17:00, Rik van Riel wrote:
> 
> >   However, we often will need
> > to get the packets off the network card before we can decide
> > whether or not they're high priority.
> 
> True - although one could argue that with NAPI that decision would be a
> few opcodes away if you install the ingress qdisc. 
> So you may end up allocing only to free a few cycles later. Increased
> memory traffic but the discard happens sufficiently early for a s/ware
> only solution and CPU cycles not burnt as much.
[...]

I think we first need a software solution that makes no special
assumptions about hardware capabilities.
 
> > Also, there can be multiple high priority sockets, and we
> > need to ensure they all make progress.  Hence the mempool
> > idea.
> 
> Sorry missed the early part of this thread: mempool is some
> strict priority scheme for mem allocation?

A mempool is a private allocation pool that attempts to maintain a
reserve of N objects. Various users in the kernel already. See
mm/mempool.c.

> For Sockets: If there was a "control" arbitrator preferably in user
> space which would install - after a socket open - both network ingress
> and/or egress rules for prioritization then wouldnt that suffice?

Generally, we don't want any special handling except when we're
effectively OOM.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 22:17                                       ` Matt Mackall
@ 2005-03-29 23:30                                         ` jamal
  0 siblings, 0 replies; 91+ messages in thread
From: jamal @ 2005-03-29 23:30 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Rik van Riel, Dmitry Yusupov, Andi Kleen, James Bottomley, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, 2005-03-29 at 17:17, Matt Mackall wrote:

> I'm sure Rik realizes this, but it's important to note here that
> "making progress" may require M acknowledgements to N packets
> representing a single IO. So we need separate send and acknowledge
> pools for each SO_MEMALLOC socket so that we don't find ourselves
> wedged with M-1 available mempool slots when we're waiting on ACKs. So
> accounting ACK packets to the appropriate receiver once we've figured
> out what socket an ACK is intended for is critical.
> 

Is this idea discussed or posted somewhere? I just subscribed to the
list.
Sounds like what the NICs i described do on rx - some strict priority
scheme.
Seems to me the TX side needs to be done early perhaps at the socket
layer.
The RX side needs to be done at the NIC or ingress qdisc.

I think there may be need for multiple levels of granularity of
priorities for mem allocation pools 8 or more if you want to have 
different levels of importantance in apps.
The deal with strict prio is the most important apps can eat all the
memory if they needed it; so you may need some form of deficit based
scheduling or kick in the algorithm only when a certain threshold is 
crossed system wide.

> Note that ACK here is the application layer command result that needs
> to be propagated back to the driver (and possibly higher in the case
> of things like CD writing over iSCSI) and not simply a bit in the TCP
> header.

I got that (given TCP is stream based).

cheers,
jamal

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 23:25                                         ` Matt Mackall
@ 2005-03-30  0:30                                           ` H. Peter Anvin
  0 siblings, 0 replies; 91+ messages in thread
From: H. Peter Anvin @ 2005-03-30  0:30 UTC (permalink / raw)
  To: Matt Mackall
  Cc: jamal, Rik van Riel, Dmitry Yusupov, Andi Kleen, James Bottomley,
	andrea, michaelc, open-iscsi, ksummit-2005-discuss, netdev

Matt Mackall wrote:
> 
> I think we first need a software solution that makes no special
> assumptions about hardware capabilities.
>  

Absolutely; having hardware assist will typically reduce, not increase, 
the memory requirements, so the "dumb hardware" solution is likely to be 
the most generic.

	-hpa

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 23:13                                       ` jamal
@ 2005-03-30  2:28                                         ` Alex Aizman
       [not found]                                         ` <E1DGSwp-0004ZE-00@thunker.thunk.org>
  2005-03-30 18:46                                         ` Dmitry Yusupov
  2 siblings, 0 replies; 91+ messages in thread
From: Alex Aizman @ 2005-03-30  2:28 UTC (permalink / raw)
  To: open-iscsi, 'Rick Jones'; +Cc: 'netdev', ksummit-2005-discuss

Jamal wrote: 
> 
> > 
> > Eventually the TCP will hit its RTX limit and punt the connection, 
> > freeing the buffers kept for retransmission right?
> > 
> 
> If i read correctly the people arguing for iscsi say thats 
> not good enough. But they may be having other issues too...

It is not good enough for storage. 

If we continue to fall back into TCP-will-eventually-recover mentality
iSCSI, or at least a "soft" non-offloaded iSCSI over regular non-TOE TCP,
will not be able to compete with FC, which uses determinstic credit-based
flow control. That non-determinism is a bigger issue, while a corner case
like swap device happenning to be iSCSI-remote is just that, a corner case
that helps to highlight and bring the general problem to the foreground.

Adding a "critical" or "resource-protected" attribute to the connection
context is a step in the right  direction. Next steps include:

- triage (closing non-critical connections in OOM);
- socket reopen without deallocating memory (something like: close(int
socket_fd, int will_reopen));
- preallocated mempools (it is much better to discover OOM at connection
open time than well into runtime).
- better resource "counting" throughout the L3 and L4 layers to preventively
handle OOM;

And more incremental steps like that.

Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 17:19                                 ` Dmitry Yusupov
  2005-03-29 21:08                                   ` jamal
@ 2005-03-30  5:12                                   ` H. Peter Anvin
  1 sibling, 0 replies; 91+ messages in thread
From: H. Peter Anvin @ 2005-03-30  5:12 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: Andi Kleen, James Bottomley, Rik van Riel, mpm, andrea, michaelc,
	open-iscsi, ksummit-2005-discuss, netdev

Dmitry Yusupov wrote:
> On Tue, 2005-03-29 at 17:20 +0200, Andi Kleen wrote:
> 
>>>In your scenario, if we're out of memory and the system needs several
>>>ACK's to the swap device for pages to be released to the system, I don't
>>>see how we make forward progress since without a reserved resource to
>>>allocate from how does the ack make it up the stack to the storage
>>>driver layer?
>>
>>Typically because the RX ring of the driver has some packets left.
> 
> 
> You can not be sure. Some NICs has very small number for possible HW
> ring buffers. Under OOM pressure, most likely, host will be so slow that
> resources might not just be returned back to the HW in time. Though it
> depends on link-layer driver implementation.
> 

This seems to become part of the whole thing... in other words, in an 
OOM situation, may have to free RX ring entries by just dropping packets 
as necessary...

	-hpa

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 21:08                                   ` jamal
  2005-03-29 22:00                                     ` Rik van Riel
  2005-03-29 22:03                                     ` Rick Jones
@ 2005-03-30 15:22                                     ` Andi Kleen
  2005-03-30 15:33                                       ` Andrea Arcangeli
  2005-03-30 17:24                                       ` Matt Mackall
  2005-03-30 17:07                                     ` Grant Grundler
  3 siblings, 2 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-30 15:22 UTC (permalink / raw)
  To: jamal
  Cc: Dmitry Yusupov, James Bottomley, Rik van Riel, mpm, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, Mar 29, 2005 at 04:08:32PM -0500, jamal wrote:
> Sender is holding onto memory (retransmit queue i assume) waiting
> for ACKs. Said sender is under OOM and therefore drops ACKs coming in
> and as a result cant let go of these precious resource sitting on the
> retransmit queue. 
> And iscsi cant wait long enough for someone else to release memory so
> the ACKs can be delivered. 
> Did i capture this correctly?

Or worse your swap device is on iscsi and you need the ACK to free
memory. 

But that is unrealistic because it could only happen if 100% of
your memory is dirty  pages or filled up by other non VM users.
Which I think is pretty unlikely. Normally the dirty limits in the VM
should prevent it anyways - VM is supposed to block before all
your memory is dirty. The CPU can still dirty pages in user space,
but the cleaner should also clean it and if necessary block
the process.

> 
> If yes, the solution maybe to just drop all non-high-prio packets coming
> in during the denial of service attack (for lack of better term). In
> other words some strict prioritization scheduling (or rate control) at
> the network level either in the NIC or ingress qdisc level.

It does not help. You would need this filtering in all possible
queues of the network (including all routers, the RX queue of the NIC etc.)
Otherwise the queue in front of you can always starve you in theory.

It is even impossible to do this filtering for normal ethernet devices
because they cannot easily distingush different flows and put
them into different queus (and even if they can it is usually useless
because the max number of flows is so small that you would add
an arbitary very small max number limit of your iscsi connections, which
users surely would not like)

And you cannot control the Ethernet in front of it neither.

The only exception would be a network that is designed around
bandwidth allocation, like an ATM network. But definitely not TCP/IP.

So you cannot solve this problem perfectly. All you can do is
to do "good enough" solutions. Have big enough pipes. Keep
big enough free memory around etc. I suspect mempools are not really
needed for that, probably just some statistical early dropping
of packets is enough to give the retransmits a high enough
chance to actually make it.

> On a slightly related topic: is SCSI (not iscsi) considered a reliable
> protocol?
> If yes, why would you wanna run a reliable protocol inside another
> reliable protocol (TCP)?

iSCSI runs on top of TCP AFAIK

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 23:00                                       ` jamal
  2005-03-29 23:25                                         ` Matt Mackall
@ 2005-03-30 15:24                                         ` Andi Kleen
  1 sibling, 0 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-30 15:24 UTC (permalink / raw)
  To: jamal
  Cc: Rik van Riel, Dmitry Yusupov, James Bottomley, mpm, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

> OTOH, even elcheapo pacific rim nics are beginning to show up with some
> classifiers in the hardware as well as multiple queues or rx 
> DMA rings. So you could program ACKs or all TCP packets to show up on a

Yes, but you end up with limits like "only support upto 4 iscsi connections".
which is totally impracticable. The features do not help.

The only feature that might help is an early interrupt where you
first process the packet and then later the payload (like the new
intel accelerator proposes or some chips have), but that would
probably add far too much overhead right now and does not work
on most chips anyways.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:22                                     ` Andi Kleen
@ 2005-03-30 15:33                                       ` Andrea Arcangeli
  2005-03-30 15:38                                         ` Rik van Riel
  2005-03-30 15:39                                         ` Andi Kleen
  2005-03-30 17:24                                       ` Matt Mackall
  1 sibling, 2 replies; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-30 15:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, Mar 30, 2005 at 05:22:08PM +0200, Andi Kleen wrote:
> Which I think is pretty unlikely. Normally the dirty limits in the VM
> should prevent it anyways - VM is supposed to block before all
> your memory is dirty. The CPU can still dirty pages in user space,

This is not true for MAP_SHARED as I mentioned earlier in this thread.
The dirty limits can't trigger unless you want to take a page fault for
every single memory write opcode that touches a clean page (and as well
marking the pte clean and wrprotect during writepage).  And we're not
going to change anything for swap and anon/shm, so it would be still an
issue for swap over iscsi.

You can be right the receive path may be less of a pratical issue, but
it's still very much an at least theoretical source of deadlock.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:33                                       ` Andrea Arcangeli
@ 2005-03-30 15:38                                         ` Rik van Riel
  2005-03-30 15:39                                         ` Andi Kleen
  1 sibling, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-30 15:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, jamal, Dmitry Yusupov, James Bottomley, mpm, michaelc,
	open-iscsi, ksummit-2005-discuss, netdev

On Wed, 30 Mar 2005, Andrea Arcangeli wrote:

> You can be right the receive path may be less of a pratical issue, but
> it's still very much an at least theoretical source of deadlock.

Oh, but this deadlock has been seen in practice.  When the
system is very low on memory, kswapd will keep allocating
memory to write things out, all the way down to 0 free pages.

Then there will be no memory left for GFP_ATOMIC allocations.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:33                                       ` Andrea Arcangeli
  2005-03-30 15:38                                         ` Rik van Riel
@ 2005-03-30 15:39                                         ` Andi Kleen
  2005-03-30 15:44                                           ` Andrea Arcangeli
  1 sibling, 1 reply; 91+ messages in thread
From: Andi Kleen @ 2005-03-30 15:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

> You can be right the receive path may be less of a pratical issue, but
> it's still very much an at least theoretical source of deadlock.

An unsolveable one IMHO. You can just try to be good enough. For that
probably simple statistical solutions (like RED on ingres queues and
very aggressive freeing of secondary caches like dcache etc.)
will be hopefully sufficient.

Basically same thing we do about highmem vs lowmem.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:39                                         ` Andi Kleen
@ 2005-03-30 15:44                                           ` Andrea Arcangeli
  2005-03-30 15:50                                             ` Rik van Riel
  2005-03-30 16:02                                             ` Andi Kleen
  0 siblings, 2 replies; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-30 15:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, Mar 30, 2005 at 05:39:48PM +0200, Andi Kleen wrote:
> An unsolveable one IMHO. You can just try to be good enough. For that

I think it's solvable with an algorithm I outlined several emails ago.

> Basically same thing we do about highmem vs lowmem.

That's not a deadlock, the lowmem vs highmem issue is about running out
of memory too early and getting a -ENOMEM out of a syscall (or a
oom-killing).

A few buggy kernels deadlocked in such condition but that just because
those buggy kernels would deadlock in most oom condition anyway, current
2.6 kernels and latest 2.4 shouldn't deadlock on a lowmem shortage.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:44                                           ` Andrea Arcangeli
@ 2005-03-30 15:50                                             ` Rik van Riel
  2005-03-30 16:04                                               ` James Bottomley
  2005-03-30 16:02                                             ` Andi Kleen
  1 sibling, 1 reply; 91+ messages in thread
From: Rik van Riel @ 2005-03-30 15:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, jamal, Dmitry Yusupov, James Bottomley, mpm, michaelc,
	open-iscsi, ksummit-2005-discuss, netdev

On Wed, 30 Mar 2005, Andrea Arcangeli wrote:
> On Wed, Mar 30, 2005 at 05:39:48PM +0200, Andi Kleen wrote:
> > An unsolveable one IMHO. You can just try to be good enough. For that
> 
> I think it's solvable with an algorithm I outlined several emails ago.

Agreed, this is definately solvable.  If people don't agree,
we should probably fight this out at the kernel summit. ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics)
  2005-03-29  3:19                               ` Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Roland Dreier
@ 2005-03-30 16:00                                 ` Benjamin LaHaise
  2005-03-31  1:08                                   ` Linux support for RDMA H. Peter Anvin
  0 siblings, 1 reply; 91+ messages in thread
From: Benjamin LaHaise @ 2005-03-30 16:00 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dmitry Yusupov, open-iscsi, David S. Miller, mpm, andrea,
	michaelc, James.Bottomley, ksummit-2005-discuss, netdev

On Mon, Mar 28, 2005 at 07:19:35PM -0800, Roland Dreier wrote:
>     Benjamin> Agreed.  After working on a full TOE implementation, I
>     Benjamin> think that the niche market most TOE vendors are
>     Benjamin> pursuing is not one that the Linux community will ever
>     Benjamin> develop for.  Hardware vendors that gradually add
>     Benjamin> offloading features from the NIC realm to speed up the
>     Benjamin> existing network stack are a much better fit with Linux.
> 
> I have to admit I don't know much about the TOE / RDMA/TCP / RNIC (or
> whatever you want to call it) world.  However I know that the large
> majority of InfiniBand use right now is running on Linux, and I hope
> the Linux community is willing to work with the IB community.

My comments were more directed to Full TOE implementations, which tend 
to suffer from incomplete feature coverage if compared to the native 
Linux TCP/IP stack.  Wedging a complete network stack onto a piece of 
hardware does allow for better performance characteristics on workloads 
where the networking overhead matters, but it comes at the cost of not 
being able to trivially change the resulting stack.  Plus there are 
very few vendors who are willing to release firmware code to the open 
source community.

> InfiniBand adoption is strong right now, with lots of large clusters
> being built.  It seems reasonable that RDMA/TCP should be able to
> compete in the same market.  Whether InfiniBand or RDMA/TCP or both
> will survive or prosper is a good question, and I think it's too early
> to tell yet.

I'm curious how the 10Gig ethernet market will pan out.  Time and again 
the market has shown that ethernet always has the cost advantage in the 
end.  If something like Intel's I/O Acceleration Technology makes it 
that much easier for commodity ethernet to achieve similar performance 
characteristics over ethernet to that of IB and fibre channel, the cost 
advantage alone might switch some new customers over.  But the hardware 
isn't near what IB offers today, making IB an important niche filler.

		-ben
-- 
"Time is what keeps everything from happening all at once." -- John Wheeler

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:44                                           ` Andrea Arcangeli
  2005-03-30 15:50                                             ` Rik van Riel
@ 2005-03-30 16:02                                             ` Andi Kleen
  2005-03-30 16:15                                               ` Andrea Arcangeli
  1 sibling, 1 reply; 91+ messages in thread
From: Andi Kleen @ 2005-03-30 16:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, Mar 30, 2005 at 05:44:18PM +0200, Andrea Arcangeli wrote:
> On Wed, Mar 30, 2005 at 05:39:48PM +0200, Andi Kleen wrote:
> > An unsolveable one IMHO. You can just try to be good enough. For that
> 
> I think it's solvable with an algorithm I outlined several emails ago.

The problem with you algorithm is that you cannot control
how to NIC puts incoming packets into RX rings (and then 
actually if the packets you are interested in do actually arrive from
the net ,-)

While some NICs have hardware support to get high priority
packets into different queues these tend to add nasty limits
on the max number of connections. Which IMHO is not acceptable.

"We have an enterprise class OS with iSCSI which can only
support four swap devices"

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:50                                             ` Rik van Riel
@ 2005-03-30 16:04                                               ` James Bottomley
  2005-03-30 17:48                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 91+ messages in thread
From: James Bottomley @ 2005-03-30 16:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Andi Kleen, jamal, Dmitry Yusupov, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, 2005-03-30 at 10:50 -0500, Rik van Riel wrote:
> Agreed, this is definately solvable.  If people don't agree,
> we should probably fight this out at the kernel summit. ;)

So just to make this explicit, we kill this thread and lobby for a topic
at the kernel summit discussing the problem of storage over net and the
ways of solving it.  Yes, I think that's a very good idea.

James

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:02                                             ` Andi Kleen
@ 2005-03-30 16:15                                               ` Andrea Arcangeli
  2005-03-30 16:55                                                 ` jamal
                                                                   ` (4 more replies)
  0 siblings, 5 replies; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-30 16:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, Mar 30, 2005 at 06:02:55PM +0200, Andi Kleen wrote:
> On Wed, Mar 30, 2005 at 05:44:18PM +0200, Andrea Arcangeli wrote:
> > On Wed, Mar 30, 2005 at 05:39:48PM +0200, Andi Kleen wrote:
> > > An unsolveable one IMHO. You can just try to be good enough. For that
> > 
> > I think it's solvable with an algorithm I outlined several emails ago.
> 
> The problem with you algorithm is that you cannot control
> how to NIC puts incoming packets into RX rings (and then 
> actually if the packets you are interested in do actually arrive from
> the net ,-)

All I care about is to assign a mempool ID to the skb (ID being unique
identifier for the tcp connection I don't care how the implementation
is). If while moving up the stack the skb data doesn't match to the
sock->mempool id, we'll just free the packet and put it back in the
mempool.

This of course only triggers with skb marked with a mempool ID, all
skb allocated with GFP_ATOMIC will have a Null ID and they won't check
anything and nothing will change for them.

After GFP fails you pick the skb from a random mempool everytime, so you
need all mempools belonging to sockets that routes somehow thorugh a
certain NIC driver instance, quickly reacheable from the NIC device
driver.

I don't see any problem with this algo. I don't need to control how NIC
process the incoming packets, after GFP fails I allocate from a 
random mempool, I set the skb mempool ID to the ID of the mempool we
picked from, and I let the stack process it. Then you need a check as
soon as you finished processing the TCP header, to release the skb back
in its originating mempool immediatly if the sock mempool ID doesn't
match the skb mempool ID but that's easy.

All it matters is that this skb can't get stuck in the middle of nowhere
in unfreeable state, but I don't see how could it get stuck in between
the netfix_rx and the sock identification via tcp and ip headers. It
just can't get stuck, either it's freed prematurely, or it's freed by us
with the new mempool id check. It could get stuck if we would let it go
ahead into some out of order queues, but not before our new check for
mempool ID after tcp header decode.

This is all going to be complex to code, but I think it's technically
doable.

> While some NICs have hardware support to get high priority
> packets into different queues these tend to add nasty limits
> on the max number of connections. Which IMHO is not acceptable.
> 
> "We have an enterprise class OS with iSCSI which can only
> support four swap devices"

;) I agree the hardware solution isn't appealing.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:15                                               ` Andrea Arcangeli
@ 2005-03-30 16:55                                                 ` jamal
  2005-03-30 18:42                                                   ` Rik van Riel
  2005-03-30 19:28                                                 ` Alex Aizman
                                                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 91+ messages in thread
From: jamal @ 2005-03-30 16:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, 2005-03-30 at 11:15, Andrea Arcangeli wrote:

> > The problem with you algorithm is that you cannot control
> > how to NIC puts incoming packets into RX rings (and then 
> > actually if the packets you are interested in do actually arrive from
> > the net ,-)
> 
> All I care about is to assign a mempool ID to the skb (ID being unique
> identifier for the tcp connection I don't care how the implementation
> is). 

Mechanisms are in place today.

> If while moving up the stack the skb data doesn't match to the
> sock->mempool id, we'll just free the packet and put it back in the
> mempool.
> 

I think you may need to reserve some small amount of buffers per NIC 
(<= RX DMA ring size) that are used as temporary buffers before the
decision is made to reassign to the higher priority memory or drop. 
The decision, if s/ware only, would need to consult a classifier at the
ingress (hopefully you are using NAPI and kick this only on overload).
The upgrade implies restoring the temporary buffer to the NIC.
The NIC rx side only makes progress if temp buffers are available.
Since they are restored a short distance after they are allocated
(whether you drop or upgrade) progress will always happen

> > "We have an enterprise class OS with iSCSI which can only
> > support four swap devices"
> 

You forgot "carrier-grade" or is that supposed to conflict with
"enteprise class"? Cant you be both? ;->

> ;) I agree the hardware solution isn't appealing.

Well, if a realtek NIC (read: elcheapo, commodity) has such features -
in the minimal (regarless of the iscsi problem)- we need to support
those features.

cheers,
jamal

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 21:08                                   ` jamal
                                                       ` (2 preceding siblings ...)
  2005-03-30 15:22                                     ` Andi Kleen
@ 2005-03-30 17:07                                     ` Grant Grundler
  3 siblings, 0 replies; 91+ messages in thread
From: Grant Grundler @ 2005-03-30 17:07 UTC (permalink / raw)
  To: jamal
  Cc: Dmitry Yusupov, Andi Kleen, James Bottomley, Rik van Riel, mpm,
	andrea, michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Tue, Mar 29, 2005 at 04:08:32PM -0500, jamal wrote:
> On a slightly related topic: is SCSI (not iscsi) considered a reliable
> protocol?

Yes and No. "SCSI" covers several layers of the ISO networking model.
Parallel SCSI transport is reliable.

> If yes, why would you wanna run a reliable protocol inside another
> reliable protocol (TCP)?

Think of SCSI command protocol more like NFS: Just a way to send
commands/data to a "storage device".

grant

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
       [not found]                                         ` <E1DGSwp-0004ZE-00@thunker.thunk.org>
@ 2005-03-30 17:16                                           ` Grant Grundler
  0 siblings, 0 replies; 91+ messages in thread
From: Grant Grundler @ 2005-03-30 17:16 UTC (permalink / raw)
  To: Alex Aizman
  Cc: open-iscsi, 'Rick Jones', 'netdev',
	ksummit-2005-discuss

On Tue, Mar 29, 2005 at 06:28:25PM -0800, Alex Aizman wrote:
> If we continue to fall back into TCP-will-eventually-recover mentality
> iSCSI, or at least a "soft" non-offloaded iSCSI over regular non-TOE TCP,
> will not be able to compete with FC, which uses determinstic credit-based
> flow control. That non-determinism is a bigger issue, while a corner case
> like swap device happenning to be iSCSI-remote is just that, a corner case
> that helps to highlight and bring the general problem to the foreground.

There is no way to fix the "non-determinism" inherent in a transport.
DoS attacks depend on this.  The transport has to be deterministic
(flow control, QoS) to avoid anything that looks like a DoS (e.g. OOM).

Parallel SCSI suffers the same problem. The priority on the transport
is dictated by SCSI ID. Low priority SCSI IDs can (and do) get starved
with as few as 5 RAID storage enclosures.  The problem is the initiator
(host controller) can send data but then like iSCSI, the command times
out when the completion doesn't arrive. For parallel SCSI, the SCSI target
device never wins SCSI bus arbitration to send back the completion.

People can configure the system so this isn't a problem.
But it means not having as much storage per SCSI bus.

hth,
grant

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 15:22                                     ` Andi Kleen
  2005-03-30 15:33                                       ` Andrea Arcangeli
@ 2005-03-30 17:24                                       ` Matt Mackall
  2005-03-30 17:39                                         ` Dmitry Yusupov
  1 sibling, 1 reply; 91+ messages in thread
From: Matt Mackall @ 2005-03-30 17:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, andrea,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, Mar 30, 2005 at 05:22:08PM +0200, Andi Kleen wrote:
> On Tue, Mar 29, 2005 at 04:08:32PM -0500, jamal wrote:
> > Sender is holding onto memory (retransmit queue i assume) waiting
> > for ACKs. Said sender is under OOM and therefore drops ACKs coming in
> > and as a result cant let go of these precious resource sitting on the
> > retransmit queue. 
> > And iscsi cant wait long enough for someone else to release memory so
> > the ACKs can be delivered. 
> > Did i capture this correctly?
> 
> Or worse your swap device is on iscsi and you need the ACK to free
> memory. 
> 
> But that is unrealistic because it could only happen if 100% of
> your memory is dirty  pages or filled up by other non VM users.
> Which I think is pretty unlikely. Normally the dirty limits in the VM
> should prevent it anyways - VM is supposed to block before all
> your memory is dirty. The CPU can still dirty pages in user space,
> but the cleaner should also clean it and if necessary block
> the process.

I seem to recall this being fairly easy to trigger by simply pulling
the network cable while there's heavy mmap + write load. The system
will quickly spiral down into OOM and will remain wedged when you plug
the network back in. With iSCSI, after some extended period all the
I/Os will have SCSI timeouts and lose everything.

It's going to be fairly typical for iSCSI boxes to do all their I/O
over iSCSI, including swap and root. Things like blades and cluster
nodes.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 17:24                                       ` Matt Mackall
@ 2005-03-30 17:39                                         ` Dmitry Yusupov
  2005-03-30 20:10                                           ` Mike Christie
  0 siblings, 1 reply; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-30 17:39 UTC (permalink / raw)
  To: open-iscsi
  Cc: Andi Kleen, jamal, James Bottomley, Rik van Riel, andrea,
	michaelc, ksummit-2005-discuss, netdev

On Wed, 2005-03-30 at 09:24 -0800, Matt Mackall wrote: 
> I seem to recall this being fairly easy to trigger by simply pulling
> the network cable while there's heavy mmap + write load. The system
> will quickly spiral down into OOM and will remain wedged when you plug
> the network back in. With iSCSI, after some extended period all the
> I/Os will have SCSI timeouts and lose everything.

We've discussed that already. SCSI timeout logic just doesn't fit. (see
rfc3720). For iSCSI, SCSI timeout logic *must* be disabled until iSCSI
recovery is complete. host block/unblock logic in recent iSCSI transport
patch will help to implement that.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:04                                               ` James Bottomley
@ 2005-03-30 17:48                                                 ` H. Peter Anvin
  0 siblings, 0 replies; 91+ messages in thread
From: H. Peter Anvin @ 2005-03-30 17:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Rik van Riel, Andrea Arcangeli, Andi Kleen, jamal, Dmitry Yusupov,
	mpm, michaelc, open-iscsi, ksummit-2005-discuss, netdev

James Bottomley wrote:
> On Wed, 2005-03-30 at 10:50 -0500, Rik van Riel wrote:
> 
>>Agreed, this is definately solvable.  If people don't agree,
>>we should probably fight this out at the kernel summit. ;)
> 
> 
> So just to make this explicit, we kill this thread and lobby for a topic
> at the kernel summit discussing the problem of storage over net and the
> ways of solving it.  Yes, I think that's a very good idea.
> 

That was the whole start of this thread; not just one but two proposed 
topics that amount to exactly this.

Yes, we need this topic.

	-hpa

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:55                                                 ` jamal
@ 2005-03-30 18:42                                                   ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-30 18:42 UTC (permalink / raw)
  To: jamal
  Cc: Andrea Arcangeli, Andi Kleen, Dmitry Yusupov, James Bottomley,
	mpm, michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, 30 Mar 2005, jamal wrote:

> I think you may need to reserve some small amount of buffers per NIC 
> (<= RX DMA ring size) that are used as temporary buffers before the
> decision is made to reassign to the higher priority memory or drop. 

At that point you also know for which of the higher
priority sockets the packet is.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-29 23:13                                       ` jamal
  2005-03-30  2:28                                         ` Alex Aizman
       [not found]                                         ` <E1DGSwp-0004ZE-00@thunker.thunk.org>
@ 2005-03-30 18:46                                         ` Dmitry Yusupov
  2 siblings, 0 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-30 18:46 UTC (permalink / raw)
  To: open-iscsi; +Cc: Rick Jones, netdev, ksummit-2005-discuss

On Tue, 2005-03-29 at 18:13 -0500, jamal wrote:

> In any case i have seen horror stories of what happened to people
> who tried to encapsulate an already reliable protocol inside TCP in
> order to ship it across the big bad internet. I am pretty sure some
> knowledgeable people were involved in getting iscsi going so it cant be
> that simple. It would seem preferable to use SCTP instead.

Few problems with iSCSI over SCTP I see:

1) It is not as accelerated as TCP yet. But it will eventually. So it is
matter of time.

2) There is no working ietf draft for iSCSI over SCTP yet. AFAIK. It
will also take few more years.

3) It is not widely used just yet.

Dima

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:15                                               ` Andrea Arcangeli
  2005-03-30 16:55                                                 ` jamal
@ 2005-03-30 19:28                                                 ` Alex Aizman
  2005-03-31 11:41                                                 ` Andi Kleen
                                                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 91+ messages in thread
From: Alex Aizman @ 2005-03-30 19:28 UTC (permalink / raw)
  To: open-iscsi, 'Andi Kleen'
  Cc: 'jamal', 'Dmitry Yusupov',
	'James Bottomley', 'Rik van Riel', mpm, michaelc,
	ksummit-2005-discuss, 'netdev'

> Andrea Arcangeli wrote: 
> 
> All I care about is to assign a mempool ID to the skb (ID 
> being unique identifier for the tcp connection I don't care 
> how the implementation is). 

It makes sense to provide an API for the NIC driver to allocate skb from the
*right* mempool. This way if I have plenty of hw rings and/or can allow
myself a luxury to associate 1-to-1 connection and ring, there's a nice and
clean memory management model. Even NICs that have only few rings could use
this - for critical (e.g., storage) connections. 

Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 17:39                                         ` Dmitry Yusupov
@ 2005-03-30 20:10                                           ` Mike Christie
  0 siblings, 0 replies; 91+ messages in thread
From: Mike Christie @ 2005-03-30 20:10 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: open-iscsi, Andi Kleen, jamal, James Bottomley, Rik van Riel,
	andrea, ksummit-2005-discuss, netdev

Dmitry Yusupov wrote:
> On Wed, 2005-03-30 at 09:24 -0800, Matt Mackall wrote: 
> 
>>I seem to recall this being fairly easy to trigger by simply pulling
>>the network cable while there's heavy mmap + write load. The system
>>will quickly spiral down into OOM and will remain wedged when you plug
>>the network back in. With iSCSI, after some extended period all the
>>I/Os will have SCSI timeouts and lose everything.
> 
> 
> We've discussed that already. SCSI timeout logic just doesn't fit. (see
> rfc3720). For iSCSI, SCSI timeout logic *must* be disabled until iSCSI
> recovery is complete.

This has actually been brought up when the scsi_times_out thread was going
on. It kinda was at least. Some driver writers inluding sfnet used to play
a lot of tricks with the timers to accomplish this (this was the goal
for sfnet at least) and I do not think linux-scsi will allow it. Maybe that
will change.

  host block/unblock logic in recent iSCSI transport
> patch will help to implement that.
> 

No it won't :( block/unblock does not disable timeouts it just makes it
so new commands are not queued and they timeout when the driver knows
that the transport is hosed.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Linux support for RDMA
  2005-03-30 16:00                                 ` Benjamin LaHaise
@ 2005-03-31  1:08                                   ` H. Peter Anvin
  0 siblings, 0 replies; 91+ messages in thread
From: H. Peter Anvin @ 2005-03-31  1:08 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Roland Dreier, Dmitry Yusupov, open-iscsi, David S. Miller, mpm,
	andrea, michaelc, James.Bottomley, ksummit-2005-discuss, netdev

Benjamin LaHaise wrote:
>  
> I'm curious how the 10Gig ethernet market will pan out.  Time and again 
> the market has shown that ethernet always has the cost advantage in the 
> end.  If something like Intel's I/O Acceleration Technology makes it 
> that much easier for commodity ethernet to achieve similar performance 
> characteristics over ethernet to that of IB and fibre channel, the cost 
> advantage alone might switch some new customers over.  But the hardware 
> isn't near what IB offers today, making IB an important niche filler.
> 

 From what I've seen coming down the pipe, I think 10GE is going to 
eventually win over IB, just like previous generations did over Token 
Ring, FDDI and other niche filler technologies.  It doesn't, as you say, 
mean that e.g. IB doesn't matter *now*; furthermore, it also matters for 
the purpose of fixing the kind of issues that are going to have to be 
fixed anyway.

	-hpa

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:15                                               ` Andrea Arcangeli
  2005-03-30 16:55                                                 ` jamal
  2005-03-30 19:28                                                 ` Alex Aizman
@ 2005-03-31 11:41                                                 ` Andi Kleen
  2005-03-31 12:12                                                   ` Rik van Riel
                                                                     ` (3 more replies)
  2005-03-31 11:45                                                 ` Andi Kleen
  2005-03-31 11:50                                                 ` Andi Kleen
  4 siblings, 4 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-31 11:41 UTC (permalink / raw)
  To: Alex Aizman
  Cc: open-iscsi, 'jamal', 'Dmitry Yusupov',
	'James Bottomley', 'Rik van Riel', mpm, michaelc,
	ksummit-2005-discuss, 'netdev'

On Wed, Mar 30, 2005 at 11:28:07AM -0800, Alex Aizman wrote:
> > Andrea Arcangeli wrote: 
> > 
> > All I care about is to assign a mempool ID to the skb (ID 
> > being unique identifier for the tcp connection I don't care 
> > how the implementation is). 
> 
> It makes sense to provide an API for the NIC driver to allocate skb from the
> *right* mempool. This way if I have plenty of hw rings and/or can allow
> myself a luxury to associate 1-to-1 connection and ring, there's a nice and
> clean memory management model. Even NICs that have only few rings could use
> this - for critical (e.g., storage) connections. 

It wont work - I can guarantee you that if you add a limit like
"we only support 8 iscsi connections max" then users/customers will raise
hell because it does not fit their networks.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:15                                               ` Andrea Arcangeli
                                                                   ` (2 preceding siblings ...)
  2005-03-31 11:41                                                 ` Andi Kleen
@ 2005-03-31 11:45                                                 ` Andi Kleen
  2005-03-31 11:50                                                 ` Andi Kleen
  4 siblings, 0 replies; 91+ messages in thread
From: Andi Kleen @ 2005-03-31 11:45 UTC (permalink / raw)
  To: Alex Aizman
  Cc: open-iscsi, 'jamal', 'Dmitry Yusupov',
	'James Bottomley', 'Rik van Riel', mpm, michaelc,
	ksummit-2005-discuss, 'netdev'

On Wed, Mar 30, 2005 at 11:28:07AM -0800, Alex Aizman wrote:
> > Andrea Arcangeli wrote: 
> > 
> > All I care about is to assign a mempool ID to the skb (ID 
> > being unique identifier for the tcp connection I don't care 
> > how the implementation is). 
> 
> It makes sense to provide an API for the NIC driver to allocate skb from the

That would pretty much need all of the infrastructure for zero copy 
RX - add all the hooks to update the device driver on local socket
hashtable updates. I am sure we will need it at some point and it makes
sense, but I suspect it is quite complex work.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-30 16:15                                               ` Andrea Arcangeli
                                                                   ` (3 preceding siblings ...)
  2005-03-31 11:45                                                 ` Andi Kleen
@ 2005-03-31 11:50                                                 ` Andi Kleen
  2005-03-31 17:09                                                   ` Andrea Arcangeli
  4 siblings, 1 reply; 91+ messages in thread
From: Andi Kleen @ 2005-03-31 11:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Wed, Mar 30, 2005 at 06:15:22PM +0200, Andrea Arcangeli wrote:
> On Wed, Mar 30, 2005 at 06:02:55PM +0200, Andi Kleen wrote:
> > On Wed, Mar 30, 2005 at 05:44:18PM +0200, Andrea Arcangeli wrote:
> > > On Wed, Mar 30, 2005 at 05:39:48PM +0200, Andi Kleen wrote:
> > > > An unsolveable one IMHO. You can just try to be good enough. For that
> > > 
> > > I think it's solvable with an algorithm I outlined several emails ago.
> > 
> > The problem with you algorithm is that you cannot control
> > how to NIC puts incoming packets into RX rings (and then 
> > actually if the packets you are interested in do actually arrive from
> > the net ,-)
> 
> All I care about is to assign a mempool ID to the skb (ID being unique
> identifier for the tcp connection I don't care how the implementation
> is). If while moving up the stack the skb data doesn't match to the
> sock->mempool id, we'll just free the packet and put it back in the
> mempool.

This could still starve on the RX ring level of the hardware which
you cant control.

But it might be an improvement, agreed. The problem is that you
need lots of infrastructure to tell the driver about TCP connections -
it is pretty much near all the work needed for zero copy RX.

Even with all that work it is  not the 100% solution some people on this thread
seem to be lusting for. 

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 11:41                                                 ` Andi Kleen
@ 2005-03-31 12:12                                                   ` Rik van Riel
  2005-03-31 18:59                                                     ` Andi Kleen
  2005-03-31 15:35                                                   ` Grant Grundler
                                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Rik van Riel @ 2005-03-31 12:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alex Aizman, open-iscsi, 'jamal',
	'Dmitry Yusupov', 'James Bottomley', mpm,
	michaelc, ksummit-2005-discuss, 'netdev'

On Thu, 31 Mar 2005, Andi Kleen wrote:

> It wont work - I can guarantee you that if you add a limit like
> "we only support 8 iscsi connections max" then users/customers will raise
> hell because it does not fit their networks.

What would prevent the iscsi driver from telling the network
stack to increase the size of the mempools when additional
iscsi connections are configured ?

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 11:41                                                 ` Andi Kleen
  2005-03-31 12:12                                                   ` Rik van Riel
@ 2005-03-31 15:35                                                   ` Grant Grundler
  2005-03-31 19:15                                                   ` Alex Aizman
  2005-03-31 19:34                                                   ` Andi Kleen
  3 siblings, 0 replies; 91+ messages in thread
From: Grant Grundler @ 2005-03-31 15:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alex Aizman, open-iscsi, 'jamal',
	'Dmitry Yusupov', 'James Bottomley',
	'Rik van Riel', mpm, michaelc, ksummit-2005-discuss,
	'netdev'

On Thu, Mar 31, 2005 at 01:41:22PM +0200, Andi Kleen wrote:
> It wont work - I can guarantee you that if you add a limit like
> "we only support 8 iscsi connections max" then users/customers will raise
> hell because it does not fit their networks.

HP has been doing that for years (decades?) for parallel SCSI
in "High Availability Configuration Guides". It lays out exactly
what is and isn't supported. I'm sure other vendors have similar
restrictions. As long as the product is still reasonably useful
and the vendor provides a solid assurance it will work, such
configuration restrictions are quite acceptable.

I'm NOT arguing "8 iSCSI connections max" is reasonable or enough.
I just arguing some sort of limit is acceptable.

grant

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 11:50                                                 ` Andi Kleen
@ 2005-03-31 17:09                                                   ` Andrea Arcangeli
  2005-03-31 22:05                                                     ` Dmitry Yusupov
  0 siblings, 1 reply; 91+ messages in thread
From: Andrea Arcangeli @ 2005-03-31 17:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Dmitry Yusupov, James Bottomley, Rik van Riel, mpm,
	michaelc, open-iscsi, ksummit-2005-discuss, netdev

On Thu, Mar 31, 2005 at 01:50:12PM +0200, Andi Kleen wrote:
> This could still starve on the RX ring level of the hardware which
> you cant control.

It may be inefficient in the recovery, but the point is that it can
recover.

> But it might be an improvement, agreed. The problem is that you
> need lots of infrastructure to tell the driver about TCP connections -
> it is pretty much near all the work needed for zero copy RX.

The driver only need to have a ring of mempools attached, OK each one is
attached to the tcp connection, but the driver won't be required to
parse the TCP/IP. After GFP_ATOMIC fails, the driver interurpt handler
will pick a skb from a random mempool.

> Even with all that work it is  not the 100% solution some people on this thread
> seem to be lusting for. 

I thought it was more than enough, all they care about is not to
deadlock anymore, I don't think anybody cares about the performance of
the deadlock-scenario.

I agree with Jamal that his suggestion to use an high-per ring is
very good (I didn't even know some card supported this feature), so if
somebody wants the deadlock scenario not to run in "degraded mode", they
will have to use some more advanced hardware the way Jamal is suggesting
(or get rid of TCP all together and use TCP/IP offload with the security
risks it introduces or RDMA or whatever other point to point high perf
DMA technology like quadrix etc..).

I suspect the deadlock scenario is infrequent enough that it won't
matter how fast it recovers as long as it eventually does.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 12:12                                                   ` Rik van Riel
@ 2005-03-31 18:59                                                     ` Andi Kleen
  2005-03-31 19:04                                                       ` Rik van Riel
  0 siblings, 1 reply; 91+ messages in thread
From: Andi Kleen @ 2005-03-31 18:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alex Aizman, open-iscsi, 'jamal',
	'Dmitry Yusupov', 'James Bottomley', mpm,
	michaelc, ksummit-2005-discuss, 'netdev'

On Thu, Mar 31, 2005 at 07:12:22AM -0500, Rik van Riel wrote:
> On Thu, 31 Mar 2005, Andi Kleen wrote:
> 
> > It wont work - I can guarantee you that if you add a limit like
> > "we only support 8 iscsi connections max" then users/customers will raise
> > hell because it does not fit their networks.
> 
> What would prevent the iscsi driver from telling the network
> stack to increase the size of the mempools when additional
> iscsi connections are configured ?

I was talking about the hardware limits for early filtering,
not the size of the mempools.

All hardware I found so far has small limits like this.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 18:59                                                     ` Andi Kleen
@ 2005-03-31 19:04                                                       ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-31 19:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alex Aizman, open-iscsi, 'jamal',
	'Dmitry Yusupov', 'James Bottomley', mpm,
	michaelc, ksummit-2005-discuss, 'netdev'

On Thu, 31 Mar 2005, Andi Kleen wrote:

> I was talking about the hardware limits for early filtering,
> not the size of the mempools.

That should not be an issue, since packets in those
buffers don't stick around, but can be throw away
at will.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 11:41                                                 ` Andi Kleen
  2005-03-31 12:12                                                   ` Rik van Riel
  2005-03-31 15:35                                                   ` Grant Grundler
@ 2005-03-31 19:15                                                   ` Alex Aizman
  2005-03-31 19:34                                                   ` Andi Kleen
  3 siblings, 0 replies; 91+ messages in thread
From: Alex Aizman @ 2005-03-31 19:15 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: open-iscsi, 'jamal', 'Dmitry Yusupov',
	'James Bottomley', 'Rik van Riel', mpm, michaelc,
	ksummit-2005-discuss, 'netdev'

> Andi Kleen wrote: 
>
> > It makes sense to provide an API for the NIC driver to allocate skb 
> > from the
> > *right* mempool. This way if I have plenty of hw rings and/or can 
> > allow myself a luxury to associate 1-to-1 connection and 
> ring, there's 
> > a nice and clean memory management model. Even NICs that 
> have only few 
> > rings could use this - for critical (e.g., storage) connections.
> 
> It wont work - I can guarantee you that if you add a limit 
> like "we only support 8 iscsi connections max" then 
> users/customers will raise hell because it does not fit their 
> networks.
> 

Something a bit more intelligent, like: we only support 7 resource-protected
(a.k.a. critical) iSCSI connection, and we use one remaining ring for the
rest iSCSI, TCP, UDP, etc. traffic. The 7 iSCSI connections could be quite a
bit, in terms of LUNs, and just enough for a customer to feel "protected" in
a sense that unrelated receive burst starves storage traffic to death. Note
that "8 rings" here is just an example; as time goes by the number of hw
receive rings and the hw ability to intelligently classify and steer traffic
onto these rings will only increase.  

Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 11:41                                                 ` Andi Kleen
                                                                     ` (2 preceding siblings ...)
  2005-03-31 19:15                                                   ` Alex Aizman
@ 2005-03-31 19:34                                                   ` Andi Kleen
  2005-03-31 19:39                                                     ` Rik van Riel
  3 siblings, 1 reply; 91+ messages in thread
From: Andi Kleen @ 2005-03-31 19:34 UTC (permalink / raw)
  To: Alex Aizman
  Cc: open-iscsi, 'jamal', 'Dmitry Yusupov',
	'James Bottomley', 'Rik van Riel', mpm, michaelc,
	ksummit-2005-discuss, 'netdev'

> Something a bit more intelligent, like: we only support 7 resource-protected
> (a.k.a. critical) iSCSI connection, and we use one remaining ring for the
> rest iSCSI, TCP, UDP, etc. traffic. The 7 iSCSI connections could be quite a
> bit, in terms of LUNs, and just enough for a customer to feel "protected" in
> a sense that unrelated receive burst starves storage traffic to death. Note
> that "8 rings" here is just an example; as time goes by the number of hw
> receive rings and the hw ability to intelligently classify and steer traffic
> onto these rings will only increase.  

[assuming you want to solve the oom deadlock 100% which I claim is
not practicable. But lets pretend it would be possible with hardware
classification support:]

This does not work, because any writable file system can be in theory
an OOM deadlock and would need to be resource protected. Just someone needs 
to mmap a file on it and dirty enough pages in it to full up system
memory and make the system deadlock while trying to clean pages
to get free memory.

The only "safe" fs that would work is a read only fs where there cant be
any dirty pages, but that would be a rather hard to sell restriction.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 19:34                                                   ` Andi Kleen
@ 2005-03-31 19:39                                                     ` Rik van Riel
  0 siblings, 0 replies; 91+ messages in thread
From: Rik van Riel @ 2005-03-31 19:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alex Aizman, open-iscsi, 'jamal',
	'Dmitry Yusupov', 'James Bottomley', mpm,
	michaelc, ksummit-2005-discuss, 'netdev'

On Thu, 31 Mar 2005, Andi Kleen wrote:

> This does not work, because any writable file system can be in theory
> an OOM deadlock and would need to be resource protected. Just someone needs 
> to mmap a file on it and dirty enough pages in it to full up system
> memory and make the system deadlock while trying to clean pages
> to get free memory.

You cannot fill up the mempools with dirty pages, which
makes sure we don't deadlock.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-31 17:09                                                   ` Andrea Arcangeli
@ 2005-03-31 22:05                                                     ` Dmitry Yusupov
  0 siblings, 0 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-03-31 22:05 UTC (permalink / raw)
  To: open-iscsi
  Cc: Andi Kleen, jamal, James Bottomley, Rik van Riel, mpm, michaelc,
	ksummit-2005-discuss, netdev

On Thu, 2005-03-31 at 19:09 +0200, Andrea Arcangeli wrote:
> > Even with all that work it is  not the 100% solution some people on this thread
> > seem to be lusting for. 
> 
> I thought it was more than enough, all they care about is not to
> deadlock anymore, I don't think anybody cares about the performance of
> the deadlock-scenario.

True. this is all we need to make "soft" iSCSI viable alternative to FC.
Btw, other OSes can do that today.

> I agree with Jamal that his suggestion to use an high-per ring is
> very good (I didn't even know some card supported this feature), so if
> somebody wants the deadlock scenario not to run in "degraded mode", they
> will have to use some more advanced hardware the way Jamal is suggesting
> (or get rid of TCP all together and use TCP/IP offload with the security
> risks it introduces or RDMA or whatever other point to point high perf
> DMA technology like quadrix etc..).

One good example is Neterion 10Gbps card. It supports up to 8 priority
rings. Could someone point me to the API which driver could utilize to
configure ring's priorities with 2.6.x? Thanks.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Linux support for RDMA
@ 2005-04-01  1:49 jaganav
  2005-04-01  1:57 ` H. Peter Anvin
  0 siblings, 1 reply; 91+ messages in thread
From: jaganav @ 2005-04-01  1:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland Dreier, Dmitry Yusupov, open-iscsi, David S. Miller, mpm,
	andrea, michaelc, James.Bottomley, ksummit-2005-discuss, netdev,
	Benjamin LaHaise

Quoting "H. Peter Anvin" <hpa@zytor.com>:
> Benjamin LaHaise wrote:
> >  
> > I'm curious how the 10Gig ethernet market will pan out.  Time and again 
> > the market has shown that ethernet always has the cost advantage in the 
> > end.  If something like Intel's I/O Acceleration Technology makes it 
> > that much easier for commodity ethernet to achieve similar performance 
> > characteristics over ethernet to that of IB and fibre channel, the cost 
> > advantage alone might switch some new customers over.  But the hardware 
> > isn't near what IB offers today, making IB an important niche filler.
> > 
> 
>  From what I've seen coming down the pipe, I think 10GE is going to 
> eventually win over IB, just like previous generations did over Token 
> Ring, FDDI and other niche filler technologies.  It doesn't, as you say, 
> mean that e.g. IB doesn't matter *now*; furthermore, it also matters for 
> the purpose of fixing the kind of issues that are going to have to be 
> fixed anyway.
> 
> 	-hpa
> 
> 
> 

No doubt, Ethernet will eventually win .. btw, Hasn't history proven this over
ATM? More specifically when the industry predicted that ATM will replace
ethernet :)

However, I'll have to agree with Ben that IB technolgy will fill an important
niche segment, more specifically so in the low end of High Performance Computing
(HPC) segment which is in a transition mode currently moving away from
proprietary interconnects to industry standards based IB technology. Eventhough,
ethernet may eventually may catch up with IB in terms of the bandwidth but IB
fabrics can offer better latencies.

Thanks
Venkat

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Linux support for RDMA
  2005-04-01  1:49 Linux support for RDMA jaganav
@ 2005-04-01  1:57 ` H. Peter Anvin
  0 siblings, 0 replies; 91+ messages in thread
From: H. Peter Anvin @ 2005-04-01  1:57 UTC (permalink / raw)
  To: jaganav
  Cc: Roland Dreier, Dmitry Yusupov, open-iscsi, David S. Miller, mpm,
	andrea, michaelc, James.Bottomley, ksummit-2005-discuss, netdev,
	Benjamin LaHaise

jaganav@us.ibm.com wrote:
> 
> No doubt, Ethernet will eventually win .. btw, Hasn't history proven this over
> ATM? More specifically when the industry predicted that ATM will replace
> ethernet :)
> 
> However, I'll have to agree with Ben that IB technolgy will fill an important
> niche segment, more specifically so in the low end of High Performance Computing
> (HPC) segment which is in a transition mode currently moving away from
> proprietary interconnects to industry standards based IB technology. Eventhough,
> ethernet may eventually may catch up with IB in terms of the bandwidth but IB
> fabrics can offer better latencies.
> 

We've seen this over and over... Token Ring, FDDI, ATM, IB, ... all of 
them "better" than the Ethernet of the day, but eventually 
commoditization wins out.  With 10GE, Ethernet has finally stopped 
pretending to be CSMA/CD even; "Ethernet" is now really nothing more 
than a collective name for a set of somewhat compatible commodity 
networking technologies.

	-hpa

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: Linux support for RDMA
@ 2005-04-01 23:50 Asgeir Eiriksson
  2005-04-02  0:02 ` Dmitry Yusupov
  0 siblings, 1 reply; 91+ messages in thread
From: Asgeir Eiriksson @ 2005-04-01 23:50 UTC (permalink / raw)
  To: jaganav, H. Peter Anvin
  Cc: Roland Dreier, Dmitry Yusupov, open-iscsi, David S. Miller, mpm,
	andrea, michaelc, James.Bottomley, ksummit-2005-discuss, netdev,
	Benjamin LaHaise

Venkat

Your assessment of the IB vs. Ethernet latencies isn't necessarily
correct.
- you already have available low latency 10GE switches (< 1us
port-to-port)
- you already have available low latency (cut-through processing) 10GE
TOE engines

The Veritest verified 10GE TOE end-to-end latency is < 10us today
(end-to-end being from a Linux user-space-process to a Linux
user-space-process through a switch; full report with detail of the
setup is available at
http://www.chelsio.com/technology/Chelsio10GbE_Fujitsu.pdf)

For comparison: the published IB latency numbers are around 5us today
and those use a polling receiver, and those don't include a context
switch(es) as does the Ethernet number quoted above.

'Asgeir


> -----Original Message-----
> From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On
> Behalf Of jaganav@us.ibm.com
> Sent: Thursday, March 31, 2005 5:49 PM
> To: H. Peter Anvin
> Cc: Roland Dreier; Dmitry Yusupov; open-iscsi@googlegroups.com; David
S.
> Miller; mpm@selenic.com; andrea@suse.de; michaelc@cs.wisc.edu;
> James.Bottomley@HansenPartnership.com; ksummit-2005-discuss@thunk.org;
> netdev@oss.sgi.com; Benjamin LaHaise
> Subject: Re: Linux support for RDMA
> 
> Quoting "H. Peter Anvin" <hpa@zytor.com>:
> > Benjamin LaHaise wrote:
> > >
> > > I'm curious how the 10Gig ethernet market will pan out.  Time and
> again
> > > the market has shown that ethernet always has the cost advantage
in
> the
> > > end.  If something like Intel's I/O Acceleration Technology makes
it
> > > that much easier for commodity ethernet to achieve similar
performance
> > > characteristics over ethernet to that of IB and fibre channel, the
> cost
> > > advantage alone might switch some new customers over.  But the
> hardware
> > > isn't near what IB offers today, making IB an important niche
filler.
> > >
> >
> >  From what I've seen coming down the pipe, I think 10GE is going to
> > eventually win over IB, just like previous generations did over
Token
> > Ring, FDDI and other niche filler technologies.  It doesn't, as you
say,
> > mean that e.g. IB doesn't matter *now*; furthermore, it also matters
for
> > the purpose of fixing the kind of issues that are going to have to
be
> > fixed anyway.
> >
> > 	-hpa
> >
> >
> >
> 
> No doubt, Ethernet will eventually win .. btw, Hasn't history proven
this
> over
> ATM? More specifically when the industry predicted that ATM will
replace
> ethernet :)
> 
> However, I'll have to agree with Ben that IB technolgy will fill an
> important
> niche segment, more specifically so in the low end of High Performance
> Computing
> (HPC) segment which is in a transition mode currently moving away from
> proprietary interconnects to industry standards based IB technology.
> Eventhough,
> ethernet may eventually may catch up with IB in terms of the bandwidth
but
> IB
> fabrics can offer better latencies.
> 
> Thanks
> Venkat

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: Linux support for RDMA
  2005-04-01 23:50 Asgeir Eiriksson
@ 2005-04-02  0:02 ` Dmitry Yusupov
  0 siblings, 0 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-04-02  0:02 UTC (permalink / raw)
  To: Asgeir Eiriksson
  Cc: jaganav, H. Peter Anvin, Roland Dreier, open-iscsi,
	David S. Miller, mpm, andrea, michaelc, James.Bottomley,
	ksummit-2005-discuss, netdev, Benjamin LaHaise

On Fri, 2005-04-01 at 15:50 -0800, Asgeir Eiriksson wrote:
> Venkat
> 
> Your assessment of the IB vs. Ethernet latencies isn't necessarily
> correct.
> - you already have available low latency 10GE switches (< 1us
> port-to-port)
> - you already have available low latency (cut-through processing) 10GE
> TOE engines
> 
> The Veritest verified 10GE TOE end-to-end latency is < 10us today
> (end-to-end being from a Linux user-space-process to a Linux
> user-space-process through a switch; full report with detail of the
> setup is available at
> http://www.chelsio.com/technology/Chelsio10GbE_Fujitsu.pdf)
> 
> For comparison: the published IB latency numbers are around 5us today
> and those use a polling receiver, and those don't include a context
> switch(es) as does the Ethernet number quoted above.

yep. I should agree in here. On 10Gbps network latencies numbers are
around 5-15us. Even with non-TOE card, I managed to get 13us latency
with regular TCP/IP stack.

[root@localhost root]# ./nptcp -a -t -l 256 -u 98304 -i 256 -p 5100 -P - h 17.1.1.227
Latency: 0.000013
Now starting main loop
  0:       256 bytes    7 times -->  131.37 Mbps in 0.000015 sec
  1:       512 bytes   65 times -->  239.75 Mbps in 0.000016 sec

Dima

> 'Asgeir
> 
> 
> > -----Original Message-----
> > From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On
> > Behalf Of jaganav@us.ibm.com
> > Sent: Thursday, March 31, 2005 5:49 PM
> > To: H. Peter Anvin
> > Cc: Roland Dreier; Dmitry Yusupov; open-iscsi@googlegroups.com; David
> S.
> > Miller; mpm@selenic.com; andrea@suse.de; michaelc@cs.wisc.edu;
> > James.Bottomley@HansenPartnership.com; ksummit-2005-discuss@thunk.org;
> > netdev@oss.sgi.com; Benjamin LaHaise
> > Subject: Re: Linux support for RDMA
> > 
> > Quoting "H. Peter Anvin" <hpa@zytor.com>:
> > > Benjamin LaHaise wrote:
> > > >
> > > > I'm curious how the 10Gig ethernet market will pan out.  Time and
> > again
> > > > the market has shown that ethernet always has the cost advantage
> in
> > the
> > > > end.  If something like Intel's I/O Acceleration Technology makes
> it
> > > > that much easier for commodity ethernet to achieve similar
> performance
> > > > characteristics over ethernet to that of IB and fibre channel, the
> > cost
> > > > advantage alone might switch some new customers over.  But the
> > hardware
> > > > isn't near what IB offers today, making IB an important niche
> filler.
> > > >
> > >
> > >  From what I've seen coming down the pipe, I think 10GE is going to
> > > eventually win over IB, just like previous generations did over
> Token
> > > Ring, FDDI and other niche filler technologies.  It doesn't, as you
> say,
> > > mean that e.g. IB doesn't matter *now*; furthermore, it also matters
> for
> > > the purpose of fixing the kind of issues that are going to have to
> be
> > > fixed anyway.
> > >
> > > 	-hpa
> > >
> > >
> > >
> > 
> > No doubt, Ethernet will eventually win .. btw, Hasn't history proven
> this
> > over
> > ATM? More specifically when the industry predicted that ATM will
> replace
> > ethernet :)
> > 
> > However, I'll have to agree with Ben that IB technolgy will fill an
> > important
> > niche segment, more specifically so in the low end of High Performance
> > Computing
> > (HPC) segment which is in a transition mode currently moving away from
> > proprietary interconnects to industry standards based IB technology.
> > Eventhough,
> > ethernet may eventually may catch up with IB in terms of the bandwidth
> but
> > IB
> > fabrics can offer better latencies.
> > 
> > Thanks
> > Venkat
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: Linux support for RDMA
@ 2005-04-02  1:59 jaganav
  0 siblings, 0 replies; 91+ messages in thread
From: jaganav @ 2005-04-02  1:59 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: Asgeir Eiriksson, H. Peter Anvin, Roland Dreier, open-iscsi,
	David S. Miller, mpm, andrea, michaelc, James.Bottomley,
	ksummit-2005-discuss, netdev, Benjamin LaHaise

Quoting Dmitry Yusupov <dima@neterion.com>:

> On Fri, 2005-04-01 at 15:50 -0800, Asgeir Eiriksson wrote:
> > Venkat
> > 
> > Your assessment of the IB vs. Ethernet latencies isn't necessarily
> > correct.
> > - you already have available low latency 10GE switches (< 1us
> > port-to-port)
> > - you already have available low latency (cut-through processing) 10GE
> > TOE engines
> > 
> > The Veritest verified 10GE TOE end-to-end latency is < 10us today
> > (end-to-end being from a Linux user-space-process to a Linux
> > user-space-process through a switch; full report with detail of the
> > setup is available at
> > http://www.chelsio.com/technology/Chelsio10GbE_Fujitsu.pdf)
> > 
> > For comparison: the published IB latency numbers are around 5us today
> > and those use a polling receiver, and those don't include a context
> > switch(es) as does the Ethernet number quoted above.
> 
> yep. I should agree in here. On 10Gbps network latencies numbers are
> around 5-15us. Even with non-TOE card, I managed to get 13us latency
> with regular TCP/IP stack.
> 
> [root@localhost root]# ./nptcp -a -t -l 256 -u 98304 -i 256 -p 5100 -P - h
> 17.1.1.227
> Latency: 0.000013
> Now starting main loop
>   0:       256 bytes    7 times -->  131.37 Mbps in 0.000015 sec
>   1:       512 bytes   65 times -->  239.75 Mbps in 0.000016 sec
> 
> Dima

When I mentioned about latency, the measurement is from
end-to-end (i.e. from app to app) but not just the
switching or port to port latencies.

With IB, I have seen the best numbers ranging from
5 to 7 us and which is far better than ethernet today 
(15 to 35us) with the network we have. I am not
denyig the fact that ethernet is trying to close the 
gap here but IB has got a relative advantage now.

Good to see you have got 5us in one case but what were
the switch and adapter latencies in this case.

Thanks
Venkat

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-03-28 22:32                             ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Benjamin LaHaise
  2005-03-29  3:19                               ` Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Roland Dreier
@ 2005-04-02 18:08                               ` Dmitry Yusupov
  2005-04-02 19:13                                 ` Ming Zhang
                                                   ` (2 more replies)
  1 sibling, 3 replies; 91+ messages in thread
From: Dmitry Yusupov @ 2005-04-02 18:08 UTC (permalink / raw)
  To: open-iscsi@googlegroups.com
  Cc: David S. Miller, mpm, andrea, michaelc, James.Bottomley,
	ksummit-2005-discuss, netdev

On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote:
> On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote:
> > If you have plans to start new project such as SoftRDMA than yes. lets
> > discuss it since set of problems will be similar to what we've got with
> > software iSCSI Initiators.
> 
> I'm somewhat interested in seeing a SoftRDMA project get off the ground.  
> At least the NatSemi 83820 gige MAC is able to provide early-rx interrupts 
> that allow one to get an rx interrupt before the full payload has arrived 
> making it possible to write out a new rx descriptor to place the payload 
> wherever it is ultimately desired.  It would be fun to work on if not the 
> most performant RDMA implementation.

I see a lot of skepticism around early-rx interrupt schema. It might
work for gige, but i'm not sure if it will fit into 10g.

What RDMA gives us is zero-copy on receive and new networking api which
has a potential to be HW accelerated. SoftRDMA will never avoid copying
on receive. But benefit for SoftRDMA would be its availability on client
sides. It is free and it could be easily deployed. Soon Intel & Co will
give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if
one of those cores will do receive side copying?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-04-02 18:08                               ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Dmitry Yusupov
@ 2005-04-02 19:13                                 ` Ming Zhang
  2005-04-04  6:31                                 ` Grant Grundler
  2005-04-04 18:57                                 ` Rick Jones
  2 siblings, 0 replies; 91+ messages in thread
From: Ming Zhang @ 2005-04-02 19:13 UTC (permalink / raw)
  To: open-iscsi
  Cc: David S. Miller, mpm, andrea, michaelc, James.Bottomley,
	ksummit-2005-discuss, netdev

On Sat, 2005-04-02 at 13:08, Dmitry Yusupov wrote:
> On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote:
> > On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote:
> > > If you have plans to start new project such as SoftRDMA than yes. lets
> > > discuss it since set of problems will be similar to what we've got with
> > > software iSCSI Initiators.
> > 
> > I'm somewhat interested in seeing a SoftRDMA project get off the ground.  
> > At least the NatSemi 83820 gige MAC is able to provide early-rx interrupts 
> > that allow one to get an rx interrupt before the full payload has arrived 
> > making it possible to write out a new rx descriptor to place the payload 
> > wherever it is ultimately desired.  It would be fun to work on if not the 
> > most performant RDMA implementation.
> 
> I see a lot of skepticism around early-rx interrupt schema. It might
> work for gige, but i'm not sure if it will fit into 10g.
> 
> What RDMA gives us is zero-copy on receive and new networking api which
> has a potential to be HW accelerated. SoftRDMA will never avoid copying
> on receive. But benefit for SoftRDMA would be its availability on client
> sides. It is free and it could be easily deployed. Soon Intel & Co will
> give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if
> one of those cores will do receive side copying?
> 
dedicated core to dealing with interrupt is fine. but the memory
bandwidth is still over-used right?

ming

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-04-02 18:08                               ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Dmitry Yusupov
  2005-04-02 19:13                                 ` Ming Zhang
@ 2005-04-04  6:31                                 ` Grant Grundler
  2005-04-04 18:57                                 ` Rick Jones
  2 siblings, 0 replies; 91+ messages in thread
From: Grant Grundler @ 2005-04-04  6:31 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: open-iscsi@googlegroups.com, David S. Miller, mpm, andrea,
	michaelc, James.Bottomley, ksummit-2005-discuss, netdev

On Sat, Apr 02, 2005 at 10:08:37AM -0800, Dmitry Yusupov wrote:
> So, who cares if one of those cores will do receive side copying?

It burns backplane bandwidth that could be used for other things.
The problem isn't the CPU cycles. It's the number
of times the data has to cross the memory bus.

grant

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics
  2005-04-02 18:08                               ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Dmitry Yusupov
  2005-04-02 19:13                                 ` Ming Zhang
  2005-04-04  6:31                                 ` Grant Grundler
@ 2005-04-04 18:57                                 ` Rick Jones
  2 siblings, 0 replies; 91+ messages in thread
From: Rick Jones @ 2005-04-04 18:57 UTC (permalink / raw)
  Cc: open-iscsi@googlegroups.com, ksummit-2005-discuss, netdev

> What RDMA gives us is zero-copy on receive and new networking api which
> has a potential to be HW accelerated. SoftRDMA will never avoid copying
> on receive. But benefit for SoftRDMA would be its availability on client
> sides. It is free and it could be easily deployed. Soon Intel & Co will
> give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if
> one of those cores will do receive side copying?

20 years ago, in certain circles at least, people were saying "With 32-bits of 
addressing, who cares if we allocate much memory" :)

Speaking a bit more prosaicly, if that core is sitting there churning through 
data copies, what affect does that have on the rest of the bus(ses) and the 
memory?  What else will the client want to be able to push around that those 
data copies may preclude?

rick jones

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2005-04-04 18:57 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4241D106.8050302@cs.wisc.edu>
     [not found] ` <20050324101622S.fujita.tomonori@lab.ntt.co.jp>
     [not found]   ` <1111628393.1548.307.camel@beastie>
     [not found]     ` <20050324113312W.fujita.tomonori@lab.ntt.co.jp>
     [not found]       ` <1111633846.1548.318.camel@beastie>
     [not found]         ` <20050324215922.GT14202@opteron.random>
     [not found]           ` <424346FE.20704@cs.wisc.edu>
     [not found]             ` <20050324233921.GZ14202@opteron.random>
     [not found]               ` <20050325034341.GV32638@waste.org>
     [not found]                 ` <20050327035149.GD4053@g5.random>
2005-03-27  5:48                   ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Matt Mackall
2005-03-27  6:04                     ` Andrea Arcangeli
2005-03-27  6:38                       ` Matt Mackall
2005-03-27 14:50                         ` Andrea Arcangeli
2005-03-27  6:33                     ` Dmitry Yusupov
2005-03-27  6:46                       ` David S. Miller
2005-03-27  7:05                         ` Dmitry Yusupov
2005-03-27  7:57                           ` David S. Miller
2005-03-27  8:18                             ` Dmitry Yusupov
2005-03-27 18:26                               ` Mike Christie
2005-03-27 18:31                                 ` David S. Miller
2005-03-27 19:58                                   ` Matt Mackall
2005-03-27 21:49                                   ` Dmitry Yusupov
2005-03-27 18:47                                 ` Dmitry Yusupov
2005-03-27 21:14                         ` Alex Aizman
     [not found]                         ` <20050327211506.85EDA16022F6@mx1.suse.de>
2005-03-28  0:15                           ` Andrea Arcangeli
2005-03-28  3:54                         ` Rik van Riel
2005-03-28  4:34                           ` David S. Miller
2005-03-28  4:50                             ` Rik van Riel
2005-03-28  6:58                           ` Alex Aizman
2005-03-28 16:12                           ` Andi Kleen
2005-03-28 16:22                             ` Andrea Arcangeli
2005-03-28 16:24                             ` Rik van Riel
2005-03-29 15:11                               ` Andi Kleen
2005-03-29 15:29                                 ` Rik van Riel
2005-03-29 17:03                                 ` Matt Mackall
2005-03-28 16:28                             ` James Bottomley
2005-03-29 15:20                               ` Andi Kleen
2005-03-29 15:56                                 ` James Bottomley
2005-03-29 17:19                                 ` Dmitry Yusupov
2005-03-29 21:08                                   ` jamal
2005-03-29 22:00                                     ` Rik van Riel
2005-03-29 22:17                                       ` Matt Mackall
2005-03-29 23:30                                         ` jamal
2005-03-29 23:00                                       ` jamal
2005-03-29 23:25                                         ` Matt Mackall
2005-03-30  0:30                                           ` H. Peter Anvin
2005-03-30 15:24                                         ` Andi Kleen
2005-03-29 22:03                                     ` Rick Jones
2005-03-29 23:13                                       ` jamal
2005-03-30  2:28                                         ` Alex Aizman
     [not found]                                         ` <E1DGSwp-0004ZE-00@thunker.thunk.org>
2005-03-30 17:16                                           ` Grant Grundler
2005-03-30 18:46                                         ` Dmitry Yusupov
2005-03-30 15:22                                     ` Andi Kleen
2005-03-30 15:33                                       ` Andrea Arcangeli
2005-03-30 15:38                                         ` Rik van Riel
2005-03-30 15:39                                         ` Andi Kleen
2005-03-30 15:44                                           ` Andrea Arcangeli
2005-03-30 15:50                                             ` Rik van Riel
2005-03-30 16:04                                               ` James Bottomley
2005-03-30 17:48                                                 ` H. Peter Anvin
2005-03-30 16:02                                             ` Andi Kleen
2005-03-30 16:15                                               ` Andrea Arcangeli
2005-03-30 16:55                                                 ` jamal
2005-03-30 18:42                                                   ` Rik van Riel
2005-03-30 19:28                                                 ` Alex Aizman
2005-03-31 11:41                                                 ` Andi Kleen
2005-03-31 12:12                                                   ` Rik van Riel
2005-03-31 18:59                                                     ` Andi Kleen
2005-03-31 19:04                                                       ` Rik van Riel
2005-03-31 15:35                                                   ` Grant Grundler
2005-03-31 19:15                                                   ` Alex Aizman
2005-03-31 19:34                                                   ` Andi Kleen
2005-03-31 19:39                                                     ` Rik van Riel
2005-03-31 11:45                                                 ` Andi Kleen
2005-03-31 11:50                                                 ` Andi Kleen
2005-03-31 17:09                                                   ` Andrea Arcangeli
2005-03-31 22:05                                                     ` Dmitry Yusupov
2005-03-30 17:24                                       ` Matt Mackall
2005-03-30 17:39                                         ` Dmitry Yusupov
2005-03-30 20:10                                           ` Mike Christie
2005-03-30 17:07                                     ` Grant Grundler
2005-03-30  5:12                                   ` H. Peter Anvin
2005-03-28 16:37                             ` Dmitry Yusupov
2005-03-28 19:45                         ` Roland Dreier
2005-03-28 20:32                           ` Topic: Remote DMA network technologies Gerrit Huizenga
2005-03-28 20:36                             ` Roland Dreier
     [not found]                           ` <1112042936.5088.22.camel@beastie>
2005-03-28 22:32                             ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Benjamin LaHaise
2005-03-29  3:19                               ` Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Roland Dreier
2005-03-30 16:00                                 ` Benjamin LaHaise
2005-03-31  1:08                                   ` Linux support for RDMA H. Peter Anvin
2005-04-02 18:08                               ` [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics Dmitry Yusupov
2005-04-02 19:13                                 ` Ming Zhang
2005-04-04  6:31                                 ` Grant Grundler
2005-04-04 18:57                                 ` Rick Jones
2005-03-29  3:14                             ` Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Roland Dreier
2005-04-01  1:49 Linux support for RDMA jaganav
2005-04-01  1:57 ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2005-04-01 23:50 Asgeir Eiriksson
2005-04-02  0:02 ` Dmitry Yusupov
2005-04-02  1:59 jaganav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).