linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vladislav Bolkhovitin <vst@vlnb.net>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bart Van Assche <bart.vanassche@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>,
	linux-scsi@vger.kernel.org, scst-devel@lists.sourceforge.net,
	linux-kernel@vger.kernel.org
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Wed, 30 Jan 2008 14:17:17 +0300	[thread overview]
Message-ID: <47A05CBD.5050803@vlnb.net> (raw)
In-Reply-To: <1201639331.3069.58.camel@localhost.localdomain>

James Bottomley wrote:
> The two target architectures perform essentially identical functions, so
> there's only really room for one in the kernel.  Right at the moment,
> it's STGT.  Problems in STGT come from the user<->kernel boundary which
> can be mitigated in a variety of ways.  The fact that the figures are
> pretty much comparable on non IB networks shows this.
> 
> I really need a whole lot more evidence than at worst a 20% performance
> difference on IB to pull one implementation out and replace it with
> another.  Particularly as there's no real evidence that STGT can't be
> tweaked to recover the 20% even on IB.

James,

Although the performance difference between STGT and SCST is apparent, 
this isn't the only point why SCST is better. I've already written about 
it many times in various mailing lists, but let me summarize it one more 
time here.

As you know, almost all kernel parts can be done in user space, 
including all the drivers, networking, I/O management with block/SCSI 
initiator subsystem and disk cache manager. But does it mean that 
currently Linux kernel is bad and all the above should be (re)done in 
user space instead? I believe, not. Linux isn't a microkernel for very 
pragmatic reasons: simplicity and performance. So, additional important 
point why SCST is better is simplicity.

For SCSI target, especially with hardware target card, data are came 
from kernel and eventually served by kernel, which does actual I/O or 
getting/putting data from/to cache. Dividing requests processing between 
user and kernel space creates unnecessary interface layer(s) and 
effectively makes the requests processing job distributed with all its 
complexity and reliability problems. From my point of view, having such 
distribution, where user space is master side and kernel is slave is 
rather wrong, because:

1. It makes kernel depend from user program, which services it and 
provides for it its routines, while the regular paradigm is the 
opposite: kernel services user space applications. As a direct 
consequence from it that there is no real protection for the kernel from 
faults in the STGT core code without excessive effort, which, no 
surprise, wasn't currently done and, seems, is never going to be done. 
So, on practice debugging and developing under STGT isn't easier, than 
if the whole code was in the kernel space, but, actually, harder (see 
below why).

2. It requires new complicated interface between kernel and user spaces 
that creates additional maintenance and debugging headaches, which don't 
exist for kernel only code. Linus Torvalds some time ago perfectly 
described why it is bad, see http://lkml.org/lkml/2007/4/24/451, 
http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364.

3. It makes for SCSI target impossible to use (at least, on a simple and 
sane way) many effective optimizations: zero-copy cached I/O, more 
control over read-ahead, device queue unplugging-plugging, etc. One 
example of already implemented such features is zero-copy network data 
transmission, done in simple 260 lines put_page_callback patch. This 
optimization is especially important for the user space gate (scst_user 
module), see below for details.

The whole point that development for kernel is harder, than for user 
space, is totally nonsense nowadays. It's different, yes, in some ways 
more limited, yes, but not harder. For ones who need gdb (I for many 
years - don't) kernel has kgdb, plus it also has many not available for 
user space or more limited there debug facilities like lockdep, lockup 
detection, oprofile, etc. (I don't mention wider choice of more 
effectively implemented synchronization primitives and not only them).

For people who need complicated target devices emulation, like, e.g., in 
case of VTL (Virtual Tape Library), where there is a need to operate 
with large mmap'ed memory areas, SCST provides gateway to the user space 
(scst_user module), but, in contrast with STGT, it's done in regular 
"kernel - master, user application - slave" paradigm, so it's reliable 
and no fault in user space device emulator can break kernel and other 
user space applications. Plus, since SCSI target state machine and 
memory management are in the kernel, it's very effective and allows only 
one kernel-user space switch per SCSI command.

Also, I should note here, that in the current state STGT in many aspects 
doesn't fully conform SCSI specifications, especially in area of 
management events, like Unit Attentions generation and processing, and 
it doesn't look like somebody cares about it. At the same time, SCST 
pays big attention to fully conform SCSI specifications, because price 
of non-conformance is a possible user's data corruption.

Returning to performance, modern SCSI transports, e.g. InfiniBand, have 
as low link latency as 1(!) microsecond. For comparison, the 
inter-thread context switch time on a modern system is about the same, 
syscall time - about 0.1 microsecond. So, only ten empty syscalls or one 
context switch add the same latency as the link. Even 1Gbps Ethernet has 
less, than 100 microseconds of round-trip latency.

You, probably, know, that QLogic Fibre Channel target driver for SCST 
allows commands being executed either directly from soft IRQ, or from 
the corresponding thread. There is a steady 5-7% difference in IOPS 
between those modes on 512 bytes reads on nullio using 4Gbps link. So, a 
single additional inter-kernel-thread context switch costs 5-7% of IOPS.

Another source of additional unavoidable with the user space approach 
latency is data copy to/from cache. With the fully kernel space 
approach, cache can be used directly, so no extra copy will be needed. 
We can estimate how much latency the data copying adds. On the modern 
systems memory copy throughput is less than 2GB/s, so on 20Gbps 
InfiniBand link it almost doubles data transfer latency.

So, putting code in the user space you should accept the extra latency 
it adds. Many, if not most, real-life workloads more or less latency, 
not throughput, bound, so there shouldn't be surprise that single stream 
"dd if=/dev/sdX of=/dev/null" on initiator gives too low values. Such 
"benchmark" isn't less important and practical, than all the 
multithreaded latency insensitive benchmarks, which people like running, 
because it does essentially the same as most Linux processes do when 
they read data from files.

You may object me that the target's backstorage device(s) latency is a 
lot more, than 1 microsecond, but that is relevant only if data are 
read/written from/to the actual backstorage media, not from the cache, 
even from the backstorage device's cache. Nothing prevents target from 
having 8 or even 64GB of cache, so most even random accesses could be 
served by it. This is especially important for sync writes.

Thus, why SCST is better:

1. It is more simple, because it's monolithic, so all its components are 
in one place and communicate using direct function calls. Hence, it is 
smaller, faster, more reliable and maintainable. Currently it's bigger, 
than STGT, just because it supports more features, see (2).

2. It supports more features: 1 to many pass-through support with all 
necessary for it functionality, including support for non-disk SCSI 
devices, like tapes, SGV cache, BLOCKIO, where requests converted to 
bio's and directly sent to block level (this mode is effective for 
random mostly workloads with data set size >> memory size on the 
target), etc.

3. It has better performance and going to have it even better. SCST only 
now enters in the phase, where it starts exploiting all advantages of 
being in the kernel. Particularly, zero-copy cached I/O is currently 
being implemented.

4. It provides safer and more effective interface to emulate target 
devices in the user space via scst_user module.

5. It much more confirms to SCSI specifications (see above).

Vlad

  parent reply	other threads:[~2008-01-30 11:17 UTC|newest]

Thread overview: 147+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-23 14:22 Integration of SCST in the mainstream Linux kernel Bart Van Assche
2008-01-23 17:11 ` Vladislav Bolkhovitin
2008-01-29 20:42 ` James Bottomley
2008-01-29 21:31   ` Roland Dreier
2008-01-29 23:32     ` FUJITA Tomonori
2008-01-30  1:15       ` [Scst-devel] " Vu Pham
2008-01-30  8:38       ` Bart Van Assche
2008-01-30 10:56         ` FUJITA Tomonori
2008-01-30 11:40           ` Vladislav Bolkhovitin
2008-01-30 13:10           ` Bart Van Assche
2008-01-30 13:54             ` FUJITA Tomonori
2008-01-31  7:48               ` Bart Van Assche
2008-01-31 13:25           ` Nicholas A. Bellinger
2008-01-31 14:34             ` Bart Van Assche
2008-01-31 14:44               ` Nicholas A. Bellinger
2008-01-31 15:50               ` Vladislav Bolkhovitin
2008-01-31 16:25                 ` [Scst-devel] " Joe Landman
2008-01-31 17:08                   ` Bart Van Assche
2008-01-31 17:13                     ` Joe Landman
2008-01-31 18:12                     ` David Dillow
2008-02-01 11:50                       ` Vladislav Bolkhovitin
2008-02-01 11:50                     ` Vladislav Bolkhovitin
2008-02-01 12:25                       ` Vladislav Bolkhovitin
2008-01-31 17:14                 ` Nicholas A. Bellinger
2008-01-31 17:40                   ` Bart Van Assche
2008-01-31 18:15                     ` Nicholas A. Bellinger
2008-02-01  9:08                       ` Bart Van Assche
2008-02-01  8:11             ` Bart Van Assche
2008-02-01 10:39               ` Nicholas A. Bellinger
2008-02-01 11:04                 ` Bart Van Assche
2008-02-01 12:05                   ` Nicholas A. Bellinger
2008-02-01 13:25                     ` Bart Van Assche
2008-02-01 14:36                       ` Nicholas A. Bellinger
2008-01-30 16:34         ` James Bottomley
2008-01-30 16:50           ` Bart Van Assche
2008-02-02 15:32           ` Pete Wyckoff
2008-02-05 17:01         ` Erez Zilber
2008-02-06 12:16           ` Bart Van Assche
2008-02-06 16:45             ` Benny Halevy
2008-02-06 17:06             ` Roland Dreier
2008-02-18  9:43             ` Erez Zilber
2008-02-18 11:01               ` Bart Van Assche
2008-02-20  7:34                 ` Erez Zilber
2008-02-20  8:41                   ` Bart Van Assche
2008-01-30 11:18       ` Vladislav Bolkhovitin
2008-01-30  8:29   ` Bart Van Assche
2008-01-30 16:22     ` James Bottomley
2008-01-30 17:03       ` Bart Van Assche
2008-02-05  7:14       ` [Scst-devel] " Tomasz Chmielewski
2008-02-05 13:38         ` FUJITA Tomonori
2008-02-05 16:07           ` Tomasz Chmielewski
2008-02-05 16:21             ` Ming Zhang
2008-02-05 16:43             ` FUJITA Tomonori
2008-02-05 17:09           ` Matteo Tescione
2008-02-06  1:29             ` FUJITA Tomonori
2008-02-06  2:01               ` Nicholas A. Bellinger
2008-01-30 11:17   ` Vladislav Bolkhovitin [this message]
2008-02-04 12:27     ` Vladislav Bolkhovitin
2008-02-04 13:53       ` Bart Van Assche
2008-02-04 17:00         ` David Dillow
2008-02-04 17:08         ` Vladislav Bolkhovitin
2008-02-05 16:25         ` Bart Van Assche
2008-02-05 18:18           ` Linus Torvalds
2008-02-04 15:30       ` James Bottomley
2008-02-04 16:25         ` Vladislav Bolkhovitin
2008-02-04 17:06           ` James Bottomley
2008-02-04 17:16             ` Vladislav Bolkhovitin
2008-02-04 17:25               ` James Bottomley
2008-02-04 17:56                 ` Vladislav Bolkhovitin
2008-02-04 18:22                   ` James Bottomley
2008-02-04 18:38                     ` Vladislav Bolkhovitin
2008-02-04 18:54                       ` James Bottomley
2008-02-05 18:59                         ` Vladislav Bolkhovitin
2008-02-05 19:13                           ` James Bottomley
2008-02-06 18:07                             ` Vladislav Bolkhovitin
2008-02-07 13:13                             ` [Scst-devel] " Bart Van Assche
2008-02-07 13:45                               ` Vladislav Bolkhovitin
2008-02-07 22:51                                 ` david
2008-02-08 10:37                                   ` Vladislav Bolkhovitin
2008-02-09  7:40                                     ` david
2008-02-08 11:33                                   ` Nicholas A. Bellinger
2008-02-08 14:36                                     ` Vladislav Bolkhovitin
2008-02-08 23:53                                       ` Nicholas A. Bellinger
2008-02-15 15:02                                 ` Bart Van Assche
2008-02-07 15:38                               ` [Scst-devel] " Nicholas A. Bellinger
2008-02-07 20:37                                 ` Luben Tuikov
2008-02-08 10:32                                   ` Vladislav Bolkhovitin
2008-02-09  7:32                                     ` Luben Tuikov
2008-02-11 10:02                                       ` Vladislav Bolkhovitin
2008-02-08 11:53                                   ` [Scst-devel] " Nicholas A. Bellinger
2008-02-08 14:42                                     ` Vladislav Bolkhovitin
2008-02-09  0:00                                       ` Nicholas A. Bellinger
2008-02-04 18:29                 ` Linus Torvalds
2008-02-04 18:49                   ` James Bottomley
2008-02-04 19:06                   ` Nicholas A. Bellinger
2008-02-04 19:19                     ` Nicholas A. Bellinger
2008-02-04 19:44                     ` Linus Torvalds
2008-02-04 20:06                       ` [Scst-devel] " 4news
2008-02-04 20:24                       ` Nicholas A. Bellinger
2008-02-04 21:01                       ` J. Bruce Fields
2008-02-04 21:24                         ` Linus Torvalds
2008-02-04 22:00                           ` Nicholas A. Bellinger
2008-02-04 22:57                           ` Jeff Garzik
2008-02-04 23:45                             ` Linus Torvalds
2008-02-05  0:08                               ` Jeff Garzik
2008-02-05  1:20                                 ` Linus Torvalds
2008-02-05  8:38                             ` Bart Van Assche
2008-02-05 17:50                               ` Jeff Garzik
2008-02-06 10:22                                 ` Bart Van Assche
2008-02-06 14:21                                   ` Jeff Garzik
2008-02-05 13:05                             ` Olivier Galibert
2008-02-05 18:08                               ` Jeff Garzik
2008-02-05 19:01                           ` Vladislav Bolkhovitin
2008-02-04 22:43                       ` Alan Cox
2008-02-04 17:30                         ` Douglas Gilbert
2008-02-05  2:07                           ` [Scst-devel] " Chris Weiss
2008-02-05 14:19                             ` FUJITA Tomonori
2008-02-04 22:59                         ` Nicholas A. Bellinger
2008-02-04 23:00                         ` James Bottomley
2008-02-04 23:12                           ` Nicholas A. Bellinger
2008-02-04 23:16                             ` Nicholas A. Bellinger
2008-02-05 18:37                             ` James Bottomley
2008-02-04 23:04                         ` Jeff Garzik
2008-02-04 23:27                           ` Linus Torvalds
2008-02-05 19:01                           ` Vladislav Bolkhovitin
2008-02-05 19:12                             ` Jeff Garzik
2008-02-05 19:21                               ` Vladislav Bolkhovitin
2008-02-06  0:11                                 ` Nicholas A. Bellinger
2008-02-06  1:43                                   ` Nicholas A. Bellinger
2008-02-12 16:05                                   ` [Scst-devel] " Bart Van Assche
2008-02-13  3:44                                     ` Nicholas A. Bellinger
2008-02-13  6:18                                       ` CONFIG_SLUB and reproducable general protection faults on 2.6.2x Nicholas A. Bellinger
2008-02-13 16:37                                         ` Nicholas A. Bellinger
2008-02-06  0:17                               ` Integration of SCST in the mainstream Linux kernel Nicholas A. Bellinger
2008-02-06  0:48                             ` Nicholas A. Bellinger
2008-02-06  0:51                               ` Nicholas A. Bellinger
2008-02-05  0:07                         ` Matt Mackall
2008-02-05  0:24                           ` Linus Torvalds
2008-02-05  0:42                             ` Jeff Garzik
2008-02-05  0:45                             ` Matt Mackall
2008-02-05  4:43                             ` [Scst-devel] " Matteo Tescione
2008-02-05  5:07                               ` James Bottomley
2008-02-05 13:38                               ` FUJITA Tomonori
2008-02-05 19:00                       ` Vladislav Bolkhovitin
2008-02-05 17:10 ` Erez Zilber
2008-02-05 19:02   ` Bart Van Assche
2008-02-05 19:02   ` Vladislav Bolkhovitin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47A05CBD.5050803@vlnb.net \
    --to=vst@vlnb.net \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=akpm@linux-foundation.org \
    --cc=bart.vanassche@gmail.com \
    --cc=fujita.tomonori@lab.ntt.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=scst-devel@lists.sourceforge.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).