All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladislav Bolkhovitin <vst@vlnb.net>
To: FUJITA Tomonori <tomof@acm.org>
Cc: robert.w.love@intel.com, yi.zou@intel.com,
	christopher.leech@intel.com, vasu.dev@intel.com,
	linux-scsi@vger.kernel.org, fujita.tomonori@lab.ntt.co.jp
Subject: Re: Open-FCoE on linux-scsi
Date: Sat, 05 Jan 2008 21:33:48 +0300	[thread overview]
Message-ID: <477FCD8C.2040404@vlnb.net> (raw)
In-Reply-To: <200801031035.m03AZYcJ012171@mbox.iij4u.or.jp>

FUJITA Tomonori wrote:
>>What's the general opinion on this? Duplicate code vs. more kernel code?
>>I can see that you're already starting to clean up the code that you
>>ported. Does that mean the duplicate code isn't an issue to you? When we
>>fix bugs in the initiator they're not going to make it into your tree
>>unless you're diligent about watching the list.
> 
> It's hard to convince the kernel maintainers to merge something into
> mainline that which can be implemented in user space. I failed twice
> (with two iSCSI target implementations).

Tomonori and "the kernel maintainers",

In fact, almost all of the kernel can be done in user space, including 
all the drivers, networking, I/O management with block/SCSI initiator 
subsystem and disk cache manager. But does it mean that currently kernel 
is bad and all the above should be (re)done in user space instead? I 
think, not. Linux isn't a microkernel for very pragmatic reasons: 
simplicity and performance.

1. Simplicity.

For SCSI target, especially with hardware target card, data are come 
from kernel and eventually served by kernel doing actual I/O or 
getting/putting data from/to cache. Dividing the requests processing job 
between user and kernel space creates unnecessary interface layer(s) and 
effectively makes the requests processing job distributed with all its 
complexity and reliability problems. As the example, what will currently 
happen in STGT if the user space part suddenly dies? Will the kernel 
part gracefully recover from it? How much effort will be needed to 
implement that?

Another example is the mentioned above code duplication. Is it good? 
What will it bring? Or you care only about amount of the kernel's code 
and don't care about the overall amount of code? If so, you should 
(re)read what Linus Torvalds thinks about that: 
http://lkml.org/lkml/2007/4/24/364 (I don't consider myself as an 
authoritative in this question)

I agree that some of the processing, which can be clearly separated, can 
and should be done in user space. The good example of such approach is 
connection negotiation and management in the way, how it's done in 
open-iscsi. But I don't agree that this idea should be driven to the 
absolute. It might look good, but it's unpractical, it will only make 
things more complicated and harder for maintainership.

2. Performance.

Modern SCSI transports, e.g. Infiniband, have as low link latency as 
1(!) microsecond. For comparison, the inter-thread context switch time 
on a modern system is about the same, syscall time - about 0.1 
microsecond. So, only ten empty syscalls or one context switch add the 
same latency as the link. Even 1Gbps Ethernet has less, than 100 
microseconds of round-trip latency.

You, most likely, know, that QLogic target driver for SCST allows 
commands being executed either directly from soft IRQ, or from the 
corresponding thread. There is a steady 5% difference in IOPS between 
those modes on 512 bytes reads on nullio using 4Gbps link. So, a single 
additional inter-kernel-thread context switch costs 5% of IOPS.

Another source of additional unavoidable with the user space approach 
latency is data copy to/from cache. With the fully kernel space 
approach, cache can be used directly, so no extra copy will be needed.

So, putting code in the user space you should accept the extra latency 
it adds. Many, if not most, real-life workloads more or less latency, 
not throughput, bound, so you shouldn't be surprised that single stream 
"dd if=/dev/sdX of=/dev/null" on initiator gives too low values. Such 
"benchmark" isn't less important and practical, than all the 
multithreaded latency insensitive benchmarks, which people like running.

You may object me that the backstorage's latency is a lot more, than 1 
microsecond, but that is true only if data are read/written from/to the 
actual backstorage media, not from the cache, even from the backstorage 
device's cache. Nothing prevents a target from having 8 or even 64GB of 
cache, so most even random accesses could be served by it. This is 
especially important for sync. writes.

Thus, I believe, that partial user space, partial kernel space approach 
for building SCSI targets is the move in the wrong direction, because it 
brings practically nothing, but costs a lot.

Vlad

  parent reply	other threads:[~2008-01-05 19:07 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-27 23:40 Open-FCoE on linux-scsi Love, Robert W
2007-11-28  0:19 ` FUJITA Tomonori
2007-11-28  0:29   ` Love, Robert W
2007-12-28 19:11 ` FUJITA Tomonori
2007-12-31 16:34   ` Love, Robert W
2008-01-03 10:35     ` FUJITA Tomonori
2008-01-03 21:58       ` Love, Robert W
2008-01-04 11:45         ` Stefan Richter
2008-01-04 11:59           ` FUJITA Tomonori
2008-01-04 22:07             ` Dev, Vasu
2008-01-04 23:41               ` Stefan Richter
2008-01-05  0:09                 ` Stefan Richter
2008-01-05  0:21                   ` Stefan Richter
2008-01-05  8:28                     ` Christoph Hellwig
2008-01-15  1:18                   ` Love, Robert W
2008-01-15 22:18                     ` James Smart
2008-01-22 23:52                       ` Love, Robert W
2008-01-29  5:42                       ` Chris Leech
2008-02-01  1:53                         ` James Smart
2008-01-06  4:14                 ` FUJITA Tomonori
2008-01-06  4:27               ` FUJITA Tomonori
2008-01-04 13:47         ` FUJITA Tomonori
2008-01-04 20:19           ` Mike Christie
2008-01-05 18:33       ` Vladislav Bolkhovitin [this message]
2008-01-06  1:28         ` FUJITA Tomonori
2008-01-08 17:38           ` Vladislav Bolkhovitin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=477FCD8C.2040404@vlnb.net \
    --to=vst@vlnb.net \
    --cc=christopher.leech@intel.com \
    --cc=fujita.tomonori@lab.ntt.co.jp \
    --cc=linux-scsi@vger.kernel.org \
    --cc=robert.w.love@intel.com \
    --cc=tomof@acm.org \
    --cc=vasu.dev@intel.com \
    --cc=yi.zou@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.