ext3 file system

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ext3 file system
@ 2003-12-17 22:13 jshankar
  2003-12-17 22:25 ` Richard B. Johnson
  0 siblings, 1 reply; 10+ messages in thread
From: jshankar @ 2003-12-17 22:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

Hello,

Does the  ext3 file systems have to wait for the acknowledgement of block of 
data written to the SCSI device before writing the next block of data.

Is there a parallel I/O where the file system goes on writing the block of 
data
without waiting for the acknowledgement.

Please let me know your opinion.

Thanks
Jayshankar

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3 file system
  2003-12-17 22:13 jshankar
@ 2003-12-17 22:25 ` Richard B. Johnson
  2003-12-17 23:02   ` Mike Fedyk
  0 siblings, 1 reply; 10+ messages in thread
From: Richard B. Johnson @ 2003-12-17 22:25 UTC (permalink / raw)
  To: jshankar; +Cc: linux-fsdevel, linux-kernel

On Wed, 17 Dec 2003, jshankar wrote:

> Hello,
>
> Does the  ext3 file systems have to wait for the acknowledgement of block of
> data written to the SCSI device before writing the next block of data.
>

No. Many SCSI drives and adapters allow queued commands and disconnect
operation.

> Is there a parallel I/O where the file system goes on writing the block of
> data
> without waiting for the acknowledgement.
>

This is the normal mode of operation.

> Please let me know your opinion.
>
> Thanks
> Jayshankar
>

Normal Unix/Linux file-systems write data to RAM. At some unknown
time, when memory gets tight, some data are written to the device.

Basically with Unix/Linux, you are using a RAM-Disk that overflows
to the physical media. There are special file-systems (journaling)
that guarantee that something, enough to recover the data, is
written at periodic intervals.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3 file system
  2003-12-17 22:25 ` Richard B. Johnson
@ 2003-12-17 23:02   ` Mike Fedyk
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Fedyk @ 2003-12-17 23:02 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: jshankar, linux-fsdevel, linux-kernel

On Wed, Dec 17, 2003 at 05:25:49PM -0500, Richard B. Johnson wrote:
> to the physical media. There are special file-systems (journaling)
> that guarantee that something, enough to recover the data, is
> written at periodic intervals.

Most journaling filesystems make guarantees on the filesystem meta-data, but
not on the data.  Some like ext3, and reiserfs (with suse's journaling
patch) can journal the data, or order things so that the data is written
before any pointers (ie meta-data) make it to the disk so it will be harder
to loose data.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: ext3 file system
@ 2003-12-17 23:25 jshankar
  2003-12-17 23:59 ` Brad Boyer
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: jshankar @ 2003-12-17 23:25 UTC (permalink / raw)
  To: Richard B. Johnson, Mike Fedyk; +Cc: linux-fsdevel, linux-kernel

Hello,

Please provide some more insight.

Suppose a filesystem issues a write command to the disk with around 10 4K 
Blocks  to be written. SCSI device point of view i don't get what is the 
parallel I/O.
It has only 1 write command. If some other sends a write request it needs to 
be queued. But the next question arises how the write data would be handled. 
Does it mean the SCSI does not give a response for the block of data written. 
In otherwords does it mean that the response would be given after all the 
block of data is written for a single write request.

Thanks
Jay

>===== Original Message From Mike Fedyk <mfedyk@matchmail.com> =====
>On Wed, Dec 17, 2003 at 05:25:49PM -0500, Richard B. Johnson wrote:
>> to the physical media. There are special file-systems (journaling)
>> that guarantee that something, enough to recover the data, is
>> written at periodic intervals.
>
>Most journaling filesystems make guarantees on the filesystem meta-data, but
>not on the data.  Some like ext3, and reiserfs (with suse's journaling
>patch) can journal the data, or order things so that the data is written
>before any pointers (ie meta-data) make it to the disk so it will be harder
>to loose data.
>-
>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3 file system
  2003-12-17 23:25 ext3 file system jshankar
@ 2003-12-17 23:59 ` Brad Boyer
  2003-12-18  1:25 ` Hans Reiser
  2003-12-18 14:17 ` Richard B. Johnson
  2 siblings, 0 replies; 10+ messages in thread
From: Brad Boyer @ 2003-12-17 23:59 UTC (permalink / raw)
  To: jshankar; +Cc: Richard B. Johnson, Mike Fedyk, linux-fsdevel, linux-kernel

I think the big thing that you're missing is that block device requests
are totally asynchronous. In general, a block gets sent down to the
block layer as needing to be transfered one way or the other. It gets
queued up in the driver for that block device, such as sd.o for SCSI.
That driver will be notified that it has requests to process, and
can handle them however it wants. When it is done with any specific
request, it calls back up and sets that request as done. You could
have multiple requests in the queue at the same time, and a driver
can be working on more than one at a time if it supports that.

In the specific case of SCSI, the host adapter and disk drives may
support various queues along the way, with any number of outstanding
requests in various buffers. The controller may be able to merge
requests on the fly in order to improve performance.

Obviously this is a fairly abstract view of the whole process. For
details you would need to read the code. You can trace the process
down (filesystem -> page buffers -> block devices -> block driver).
In the SCSI case, the block driver is sd.o, and you then can follow
down into the generic SCSI mid-layer and the controller driver.

	Brad Boyer
	flar@allandria.com

On Wed, Dec 17, 2003 at 04:25:11PM -0700, jshankar wrote:
> Please provide some more insight.
> 
> Suppose a filesystem issues a write command to the disk with around 10 4K 
> Blocks  to be written. SCSI device point of view i don't get what is the 
> parallel I/O.
> It has only 1 write command. If some other sends a write request it needs to 
> be queued. But the next question arises how the write data would be handled. 
> Does it mean the SCSI does not give a response for the block of data written. 
> In otherwords does it mean that the response would be given after all the 
> block of data is written for a single write request.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3 file system
  2003-12-17 23:25 ext3 file system jshankar
  2003-12-17 23:59 ` Brad Boyer
@ 2003-12-18  1:25 ` Hans Reiser
  2003-12-18 14:17 ` Richard B. Johnson
  2 siblings, 0 replies; 10+ messages in thread
From: Hans Reiser @ 2003-12-18  1:25 UTC (permalink / raw)
  To: jshankar; +Cc: Richard B. Johnson, Mike Fedyk, linux-fsdevel, linux-kernel

jshankar wrote:

>Hello,
>
>Please provide some more insight.
>
>Suppose a filesystem issues a write command to the disk with around 10 4K 
>Blocks  to be written. SCSI device point of view i don't get what is the 
>parallel I/O.
>It has only 1 write command. If some other sends a write request it needs to 
>be queued. But the next question arises how the write data would be handled. 
>Does it mean the SCSI does not give a response for the block of data written. 
>In otherwords does it mean that the response would be given after all the 
>block of data is written for a single write request.
> 
>Thanks
>Jay
>
>
>
>
>  
>
>>===== Original Message From Mike Fedyk <mfedyk@matchmail.com> =====
>>On Wed, Dec 17, 2003 at 05:25:49PM -0500, Richard B. Johnson wrote:
>>    
>>
>>>to the physical media. There are special file-systems (journaling)
>>>that guarantee that something, enough to recover the data, is
>>>written at periodic intervals.
>>>      
>>>
>>Most journaling filesystems make guarantees on the filesystem meta-data, but
>>not on the data.  Some like ext3, and reiserfs (with suse's journaling
>>patch) can journal the data, or order things so that the data is written
>>before any pointers (ie meta-data) make it to the disk so it will be harder
>>to loose data.
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>    
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>
Filesystems don't usually wait on the IO to complete before submitting 
more IO in response to the next write() syscall.  They can do this by 
batching a whole bunch of operations into one committed transaction.

In reiser4 we do this more carefully than other filesystems such as 
reiserfs v3, and as a result every fs operation is fully atomic.

-- 
Hans



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: ext3 file system
@ 2003-12-18  4:47 jshankar
  2003-12-18  8:39 ` Mike Fedyk
  0 siblings, 1 reply; 10+ messages in thread
From: jshankar @ 2003-12-18  4:47 UTC (permalink / raw)
  To: Hans Reiser; +Cc: linux-fsdevel, linux-kernel

Hello Hans,

>Filesystems don't usually wait on the IO to complete before submitting
>more IO in response to the next write() syscall.  They can do this by
>batching a whole bunch of operations into one committed transaction.
>

Is there a timeout mechanism for batching operations. What if certain 
operation
is done after the batch operation is executed. Does it mean that the new 
operation has to wait.

Thanks
Jay


>===== Original Message From Hans Reiser <reiser@namesys.com> =====
>jshankar wrote:
>
>>Hello,
>>
>>Please provide some more insight.
>>
>>Suppose a filesystem issues a write command to the disk with around 10 4K
>>Blocks  to be written. SCSI device point of view i don't get what is the
>>parallel I/O.
>>It has only 1 write command. If some other sends a write request it needs to
>>be queued. But the next question arises how the write data would be handled.
>>Does it mean the SCSI does not give a response for the block of data 
written.
>>In otherwords does it mean that the response would be given after all the
>>block of data is written for a single write request.
>>
>>Thanks
>>Jay
>>
>>
>>
>>
>>
>>
>>>===== Original Message From Mike Fedyk <mfedyk@matchmail.com> =====
>>>On Wed, Dec 17, 2003 at 05:25:49PM -0500, Richard B. Johnson wrote:
>>>
>>>
>>>to the physical media. There are special file-systems (journaling)
>>>that guarantee that something, enough to recover the data, is
>>>written at periodic intervals.
>>>
>>>
>>>Most journaling filesystems make guarantees on the filesystem meta-data, 
but
>>>not on the data.  Some like ext3, and reiserfs (with suse's journaling
>>>patch) can journal the data, or order things so that the data is written
>>>before any pointers (ie meta-data) make it to the disk so it will be harder
>>>to loose data.
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>>>the body of a message to majordomo@vger.kernel.org
>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>
>>
>In reiser4 we do this more carefully than other filesystems such as
>reiserfs v3, and as a result every fs operation is fully atomic.
>
>--
>Hans
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3 file system
  2003-12-18  4:47 jshankar
@ 2003-12-18  8:39 ` Mike Fedyk
  2003-12-18 10:41   ` Hans Reiser
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Fedyk @ 2003-12-18  8:39 UTC (permalink / raw)
  To: jshankar; +Cc: Hans Reiser, linux-fsdevel, linux-kernel

On Wed, Dec 17, 2003 at 09:47:59PM -0700, jshankar wrote:
> Hello Hans,
> 
> >Filesystems don't usually wait on the IO to complete before submitting
> >more IO in response to the next write() syscall.  They can do this by
> >batching a whole bunch of operations into one committed transaction.
> >
> 
> Is there a timeout mechanism for batching operations. What if certain 
> operation
> is done after the batch operation is executed. Does it mean that the new 
> operation has to wait.

You don't have to wait unless you run out of available non-dirty memory, or
issue a call to sync to the disks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3 file system
  2003-12-18  8:39 ` Mike Fedyk
@ 2003-12-18 10:41   ` Hans Reiser
  0 siblings, 0 replies; 10+ messages in thread
From: Hans Reiser @ 2003-12-18 10:41 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: jshankar, linux-fsdevel, linux-kernel

Mike Fedyk wrote:

>On Wed, Dec 17, 2003 at 09:47:59PM -0700, jshankar wrote:
>  
>
>>Hello Hans,
>>
>>    
>>
>>>Filesystems don't usually wait on the IO to complete before submitting
>>>more IO in response to the next write() syscall.  They can do this by
>>>batching a whole bunch of operations into one committed transaction.
>>>
>>>      
>>>
>>Is there a timeout mechanism for batching operations.
>>
At some point due to its age or size you decide the batch needs to commit.

>> What if certain 
>>operation
>>is done after the batch operation is executed. Does it mean that the new 
>>operation has to wait.
>>    
>>
>
>You don't have to wait unless you run out of available non-dirty memory, or
>issue a call to sync to the disks.
>
>
>  
>


-- 
Hans



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: ext3 file system
  2003-12-17 23:25 ext3 file system jshankar
  2003-12-17 23:59 ` Brad Boyer
  2003-12-18  1:25 ` Hans Reiser
@ 2003-12-18 14:17 ` Richard B. Johnson
  2 siblings, 0 replies; 10+ messages in thread
From: Richard B. Johnson @ 2003-12-18 14:17 UTC (permalink / raw)
  To: jshankar; +Cc: Mike Fedyk, linux-fsdevel, linux-kernel

On Wed, 17 Dec 2003, jshankar wrote:

> Hello,
>
> Please provide some more insight.
>
> Suppose a filesystem issues a write command to the disk with around 10 4K
> Blocks  to be written. SCSI device point of view i don't get what is the
> parallel I/O.
> It has only 1 write command. If some other sends a write request it needs to
> be queued. But the next question arises how the write data would be handled.
> Does it mean the SCSI does not give a response for the block of data written.
> In otherwords does it mean that the response would be given after all the
> block of data is written for a single write request.
>
> Thanks
> Jay

I guess you completely misunderstand. Any I/O to the physical devices
are completely asynchronous. There is no relationship between when
an application writes a buffer of data to a file, and when it gets
written to the physical media. This includes the device-file,
i.e., the raw device with no file-system.

What is implemented is called VFS (Virtual File System). It is
a RAM-Disk with all user data going to and from the RAM-Disk.

In principle, many temporary files never even get written to
the physical device. They are created, written, read, then
deleted long before there is any reason to write to the physical
media. Writing to physical media is a performance bottle-neck.

Eventually, the supply of kernel buffers used to keep the
file-system data might get short. When it does, the kernel
writes (through the drivers) data to the devices using a
LRU (least recently used) algorithm. This write also is
asynchronous. It gets handed-off to a SCSI, or IDE, or whatever,
driver which should eventually get the data into the drives.

In the meantime, the devices may time-out, there may be errors
that require the writes to be retried, etc. Eventually the
operating system will be notified that a write succeeded so
that particular amount of RAM containing the data can be
freed.

Even if there are errors, a subsequent read of the data, which
comes from RAM will succeed. It is only after that data gets
to the drive that subsequent reads may require the data to be
re-read from the drive.

All this work executes in parallel with the work of the
application software. Notification of the success or failure
of a particular operation is handled in the drivers using
an interrupt. With common SCSI controllers, data are transferred
using Bus-Master DMA so the CPU continues handling user and
kernel code while the DMA is occurring. The CPU is not locked-
off the bus during DMA so there is additional parallelism
under these conditions.

At the SCSI device-driver level, typically a data block is
built that tells the SCSI controller all it needs to know
about the transfer. The controller is then "told" to execute
the command. The success or failure of the command is determined
by some status read in an interrupt. The controller does whatever
it needs to do, to get the data to the drive without using
the CPU at all. This means that the CPU can be executing code
(doing useful work) in parallel with the data transfer.

You can force the file-systems to write their data to the
physical media by executing sync(). This is not a good thing
to do very often if you expect any reasonable performance.

The only time all the data gets to the drive(s) is when they
are dismounted (umount). This gets all the data into the drives
and severs the logical connection between your applications and
the file-systems that you just dismounted.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-12-18 14:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-17 23:25 ext3 file system jshankar
2003-12-17 23:59 ` Brad Boyer
2003-12-18  1:25 ` Hans Reiser
2003-12-18 14:17 ` Richard B. Johnson
  -- strict thread matches above, loose matches on Subject: below --
2003-12-18  4:47 jshankar
2003-12-18  8:39 ` Mike Fedyk
2003-12-18 10:41   ` Hans Reiser
2003-12-17 22:13 jshankar
2003-12-17 22:25 ` Richard B. Johnson
2003-12-17 23:02   ` Mike Fedyk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox