* Linux networking and disk IO issues
@ 2001-06-04 16:33 Mark Hayden
2001-06-04 20:02 ` Alan Cox
0 siblings, 1 reply; 3+ messages in thread
From: Mark Hayden @ 2001-06-04 16:33 UTC (permalink / raw)
To: linux-kernel
I recently released a clusted storage system for Linux (the software
in binary form and manual can be downloaded from
www.northforknet.com). This software, you can create a highly
available storage cluster out of standard PC hardware.
During this work, we encountered a number of problems with the Linux
kernel. I believe these all apply to the current kernels (though I'm
working with the 2.4.2 kernel). If you respond, please CC me
directly, since I follow Linux kernel development through weekly
summaries in Linux Weekly News.
regards, Mark Hayden
mark@northforknet.com
* The Linux networking stack requires all skbuff buffers to be
contiguous. As far as I can tell, this makes it impossible to
write high-bandwidth UDP applications on Linux. For instance, the
kernel will drop a fragmented 8KB message if it cannot allocate 8KB
of contiguous memory to reassemble it into. I have found that it
is relatively easy to enter regimes where this can cause massive
packet loss.
* readv()/writev(). Linux serializes scatter/gather IO operations
into an operation for each iovec entry. This is the relevent code
from a 2.4-series kernel:
/* VERIFY_WRITE actually means a read, as we write to user space */
fn = (type == VERIFY_WRITE ? file->f_op->read :
(io_fn_t) file->f_op->write);
ret = 0;
vector = iov;
while (count > 0) {
void * base;
size_t len;
ssize_t nr;
base = vector->iov_base;
len = vector->iov_len;
vector++;
count--;
nr = fn(file, base, len, &file->f_pos);
if (nr < 0) {
if (!ret) ret = nr;
break;
}
ret += nr;
if (nr != len)
break;
}
This causes several problems:
* For writes, it forces read-modify-write when the individual
iovecs are not block-aligned.
* For reads, it prevents all the read requests from being presented
at the same time to the IO system. This is a problem for raw IO
without read-ahead.
* There is no preadv(), pwritev(). (The pread/pwrite() system calls
combine a llseek with a read/write system call.) This means that
if you want to have multiple threads in a process write random
blocks using scatter-gather, you need to open() a device file
multiple times and make the extra llseek() calls.
* The requirement that everything about operations to raw character
device files (length, offset in file, *and* address in memory) has
to be 512-byte aligned is a real hassle.
* There are several assumptions in the kernel that make it very
difficult to write virtual block devices that convert IO operations
into networked RPC requests. For instance, if you run the normal
NBD device where the server is on the same machine in the client,
you will likely deadlock your system. Our software distribution
includes a patch to the 2.4.2 kernel that prevents these deadlock
scenarios with NBD, but it is something of a hack (I want to thank
Stephen Tweedie for his help in developing this work-around, though
of course the hack is my responsibility.) I don't know what could
be done to fix these problems correctly, without a major changes to
block IO in the kernel.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Linux networking and disk IO issues
2001-06-04 16:33 Linux networking and disk IO issues Mark Hayden
@ 2001-06-04 20:02 ` Alan Cox
0 siblings, 0 replies; 3+ messages in thread
From: Alan Cox @ 2001-06-04 20:02 UTC (permalink / raw)
To: Mark Hayden; +Cc: linux-kernel
> * The Linux networking stack requires all skbuff buffers to be
> contiguous. As far as I can tell, this makes it impossible to
> write high-bandwidth UDP applications on Linux. For instance, the
> kernel will drop a fragmented 8KB message if it cannot allocate 8KB
> of contiguous memory to reassemble it into. I have found that it
> is relatively easy to enter regimes where this can cause massive
> packet loss.
If you are fragmenting messages then you want to optimise the protocol a bit
more. IP fragmentation increases processing overheads and reduces performance
badly in the presence of link congestion and error.
Most modern file sharing protocols are TCP based for good reason
> * readv()/writev(). Linux serializes scatter/gather IO operations
> into an operation for each iovec entry. This is the relevent code
> from a 2.4-series kernel:
Not on a socket. On a file it makes very little difference. Socket readv/writev
behaviour varies by protocol family.
> * For writes, it forces read-modify-write when the individual
> iovecs are not block-aligned.
>From cache, of data live in the L1 cache of the CPU
> * There is no preadv(), pwritev(). (The pread/pwrite() system calls
> combine a llseek with a read/write system call.) This means that
True. The single unix specification does not include preadv(). Really you want
to take it up with the Opengroup. That said Linux does add syscalls that are
not in SuS sometimes.
> * The requirement that everything about operations to raw character
> device files (length, offset in file, *and* address in memory) has
> to be 512-byte aligned is a real hassle.
Welcome to PC hardware. Large amounts of PC hardware genuinely has limitations
of this nature. Most disk controllers can only write whole sectors on a sector
alignment. Many network controllers can only handle burst or 32bit alignment
policies
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Linux networking and disk IO issues
[not found] <3B1BB85B.360CE0F6@northforknet.com.suse.lists.linux.kernel>
@ 2001-06-13 10:36 ` Andi Kleen
0 siblings, 0 replies; 3+ messages in thread
From: Andi Kleen @ 2001-06-13 10:36 UTC (permalink / raw)
To: Mark Hayden; +Cc: linux-kernel
[this time with l-k cc]
Mark Hayden <mark@northforknet.com> writes:
> * The Linux networking stack requires all skbuff buffers to be
> contiguous. As far as I can tell, this makes it impossible to
> write high-bandwidth UDP applications on Linux. For instance, the
> kernel will drop a fragmented 8KB message if it cannot allocate 8KB
> of contiguous memory to reassemble it into. I have found that it
> is relatively easy to enter regimes where this can cause massive
> packet loss.
2.4.4+ supports fragmented packets and packet lists.
You're probably seeing the 8K allocation problem for incoming packets which need to be
allocated by the driver on interrupt time with GFP_ATOMIC. GFP_ATOMIC memory is limited.
The 2.4 VM unfortunately has no way to keep more GFP_ATOMIC free ATM and tune for heavy
interrupt load (2.2 allowed this by increasing the freepages sysctl). Hopefully this VM bug
will be fixed in the not too far future.
A workaround in the driver would be to use the 2.4.4 fragmented buffers
(of course you'll still run into GFP_ATOMIC limits without manual tuning)
or allocate RX memory from a thread with GFP_KERNEL.
-Andi
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2001-06-13 10:36 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-06-04 16:33 Linux networking and disk IO issues Mark Hayden
2001-06-04 20:02 ` Alan Cox
[not found] <3B1BB85B.360CE0F6@northforknet.com.suse.lists.linux.kernel>
2001-06-13 10:36 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox