public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Hayden <mark@northforknet.com>
To: linux-kernel@vger.kernel.org
Subject: Linux networking and disk IO issues
Date: Mon, 04 Jun 2001 09:33:31 -0700	[thread overview]
Message-ID: <3B1BB85B.360CE0F6@northforknet.com> (raw)

I recently released a clusted storage system for Linux (the software
in binary form and manual can be downloaded from
www.northforknet.com).  This software, you can create a highly
available storage cluster out of standard PC hardware.

During this work, we encountered a number of problems with the Linux
kernel.  I believe these all apply to the current kernels (though I'm
working with the 2.4.2 kernel).  If you respond, please CC me
directly, since I follow Linux kernel development through weekly
summaries in Linux Weekly News.

regards, Mark Hayden
mark@northforknet.com

* The Linux networking stack requires all skbuff buffers to be
  contiguous.  As far as I can tell, this makes it impossible to
  write high-bandwidth UDP applications on Linux.  For instance, the
  kernel will drop a fragmented 8KB message if it cannot allocate 8KB
  of contiguous memory to reassemble it into.  I have found that it
  is relatively easy to enter regimes where this can cause massive
  packet loss.

* readv()/writev().  Linux serializes scatter/gather IO operations
  into an operation for each iovec entry.  This is the relevent code
  from a 2.4-series kernel:

	/* VERIFY_WRITE actually means a read, as we write to user space */
	fn = (type == VERIFY_WRITE ? file->f_op->read :
	      (io_fn_t) file->f_op->write);

	ret = 0;
	vector = iov;
	while (count > 0) {
		void * base;
		size_t len;
		ssize_t nr;

		base = vector->iov_base;
		len = vector->iov_len;
		vector++;
		count--;

		nr = fn(file, base, len, &file->f_pos);

		if (nr < 0) {
			if (!ret) ret = nr;
			break;
		}
		ret += nr;
		if (nr != len)
			break;
	}

  This causes several problems:

  * For writes, it forces read-modify-write when the individual
    iovecs are not block-aligned.

  * For reads, it prevents all the read requests from being presented
    at the same time to the IO system.  This is a problem for raw IO
    without read-ahead.

* There is no preadv(), pwritev().  (The pread/pwrite() system calls
  combine a llseek with a read/write system call.)  This means that
  if you want to have multiple threads in a process write random
  blocks using scatter-gather, you need to open() a device file
  multiple times and make the extra llseek() calls.

* The requirement that everything about operations to raw character
  device files (length, offset in file, *and* address in memory) has
  to be 512-byte aligned is a real hassle.

* There are several assumptions in the kernel that make it very
  difficult to write virtual block devices that convert IO operations
  into networked RPC requests.  For instance, if you run the normal
  NBD device where the server is on the same machine in the client,
  you will likely deadlock your system.  Our software distribution
  includes a patch to the 2.4.2 kernel that prevents these deadlock
  scenarios with NBD, but it is something of a hack (I want to thank
  Stephen Tweedie for his help in developing this work-around, though
  of course the hack is my responsibility.)  I don't know what could
  be done to fix these problems correctly, without a major changes to
  block IO in the kernel.

             reply	other threads:[~2001-06-04 17:03 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-06-04 16:33 Mark Hayden [this message]
2001-06-04 20:02 ` Linux networking and disk IO issues Alan Cox
     [not found] <3B1BB85B.360CE0F6@northforknet.com.suse.lists.linux.kernel>
2001-06-13 10:36 ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3B1BB85B.360CE0F6@northforknet.com \
    --to=mark@northforknet.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox