All of lore.kernel.org
 help / color / mirror / Atom feed
* Desynchronizing dm-raid1
       [not found]                   ` <20080319011757.GD12007@agk.fab.redhat.com>
@ 2008-04-02 20:23                     ` Mikulas Patocka
  2008-04-02 22:13                       ` Mikulas Patocka
                                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Mikulas Patocka @ 2008-04-02 20:23 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: dm-devel

Hi

Unfortunatelly, the bug with desychnronizing raid1 that someone pointed 
out on Monday, is real. The bug happens when you modify the page while its 
being written to raid1 device --- old version can be written to one mirror 
leg, the new versions to the other mirror leg. Raid1 code does not notice 
this, marks the region clean after the writes finish, and the volume stays 
desynchronized.

The possibilities, how data can be modified while they are being written.

1. an application does O_DIRECT IO and modifies the memory underway.

--- this is a problem of the application and we don't have to care about 
it.

2. an application maps file for writing. pdflush or kswapd daemon writes 
the page on background while the application is modifying it.

3. an application writes to a page with write() syscall. This syscall 
can race with pdflush or kswapd as well.

4. a filesystem modifies the buffer while its being written by pdflush or 
kswapd daemons.


The pdflush and kswapd daemons run in background and do periodic writes of 
the modified data. pdflush is triggered regularly and writes data in 
specified interval (about 30 seconds), so that in case of crash, the image 
on disk is not too old. kswapd is triggered when the free memory goes low 
--- it writes file pages and filesystem buffers too.

In cases 2,3,4 the data may be modified while they are being written, 
but the kernel writes them later again. The sequence is something like:
clear dirty bit
submit IO
--- if the data are modified while the IO is in progress, the dirty bit is 
turned on again and the data will be written later and possible data 
corruption is corrected. --- so as long as the system does not crash, 
there can't be desynchronized mirror.

But if the system crashes before the data are written second time, the 
blocks may stay desynchronized.

An example of data corruption on ext2:

We have a dirty bitmap buffer
Pdflush clears the dirty flag and starts writing the buffer
The write is submitted to dm-raid1, it makes two requests and submits them 
to two mirror devices

This operation races with another thread allocating a block on ext2 and 
doing:
ext2_new_blocks
calling read_block_bitmap
 	calling sb_getblk
 	calling bh_uptodate_or_lock --- sees that the buffer is uptodate 
(even if it's under write), so it returns.
calling ext2_try_to_allocate_with_rsv
 	calling ext2_try_to_allocate
 		calling ext2_set_bit_atomic --- this modifies the bitmap
 		*** now suppose that 2nd mirror device already finished
 		its write and don't get updated bit, while 1st mirror
 		device writes the updated bit to disk.
calling mark_buffer_dirty --- this schedules new update of the buffer 
(after several seconds)

Both writes finished, dm-raid1 driver turns off the dirty bit for the 
region.

Before pdflush writes the buffer second time, we get a
***CRASH***

After new boot, dm-raid1 doesn't update the region, because the region's 
bit is off. fsck scans the device. It reads the bitmap from the first 
device, sees that the bit is correctly set and doesn't write the bitmap.

Some times later, the administrator removes the 1st disk, the kernel 
starts reading from 2nd mirror. Ext2 allocates another file, it reads the 
bitmap from the 2nd device, sees the bit is off and allocates another 
block there. Now there is data corruption => two files pointing to the 
same block.


Ideas how to fix it:

1. lock the buffers and unmap the pages while they are being written.
--- upstream developers would likely reject it. No other driver than 
dm-raid1 has problems with this and they wouldn't damp performance because 
of one driver.

2. never turn the region dirty bit off until the filesystem is unmounted.
--- simplest fix. If the computer crashes after a long time, it 
resynchronizes the whole device. md-raid resynchronizes the whole device 
after a crash too.

3. turn off the bit if the block wasn't written in one pdflush period
--- requires an interaction with pdflush, rather complex, I wouldn't 
recommend it.

4. make more region states.
--- If the region is in RH_DIRTY state and all writes drain, the state is 
changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or 
not). The disk dirty flag is kept.
--- periodically (once in few minutes, so that it doesn't affect 
performance much), the change all regions in RH_MAYBE_DIRTY state to 
RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the 
sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been 
written during the sync()), it is moved to RH_CLEAN state and the on-disk 
bit for the region is turned off.

If one of the above scenarios 2,3,4 happened (modifying a buffer while 
it's under the disk write), the the sync() would have written the buffer 
again and kicked the region out of RH_CLEAN_CANDIDATE state. If the sync() 
didn't touch the buffer than we are sure that both on-disk copies are 
synchronized.


Do you have any other ideas on this?

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-02 20:23                     ` Desynchronizing dm-raid1 Mikulas Patocka
@ 2008-04-02 22:13                       ` Mikulas Patocka
  2008-04-03  1:40                       ` malahal
  2008-04-03  9:19                       ` Heinz Mauelshagen
  2 siblings, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2008-04-02 22:13 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: dm-devel

> No other driver than dm-raid1 
> has problems with this and they wouldn't damp performance because of one 
> driver.

--- so I found that md-raid-[156] recently (2.6.13 or so) added a bitmap 
mode and when this is used (argument --bitmap to mdadm), it is vulnerable 
to this desynchronization bug too. Before, it synced the whole device when 
it crashed without shutdown, so it was OK.

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-02 20:23                     ` Desynchronizing dm-raid1 Mikulas Patocka
  2008-04-02 22:13                       ` Mikulas Patocka
@ 2008-04-03  1:40                       ` malahal
  2008-04-03 14:49                         ` Martin K. Petersen
                                           ` (2 more replies)
  2008-04-03  9:19                       ` Heinz Mauelshagen
  2 siblings, 3 replies; 30+ messages in thread
From: malahal @ 2008-04-03  1:40 UTC (permalink / raw)
  To: dm-devel

Mikulas Patocka [mpatocka@redhat.com] wrote:
> Ideas how to fix it:
>
> 1. lock the buffers and unmap the pages while they are being written.
> --- upstream developers would likely reject it. No other driver than 
> dm-raid1 has problems with this and they wouldn't damp performance because 
> of one driver.

Very few drivers require it, so how about an interface to lock the pages
of an I/O available to drivers. Only needed RAID drivers would lock the
I/O while it is in progress and they only pay the performance penalty.
mmap pages are a bit tricky. They need to go into read-only mode when an
I/O is in progress. I know this would likely be rejected too!!!

> 4. make more region states.
> --- If the region is in RH_DIRTY state and all writes drain, the state is 
> changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or 
> not). The disk dirty flag is kept.
> --- periodically (once in few minutes, so that it doesn't affect 
> performance much), the change all regions in RH_MAYBE_DIRTY state to 
> RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the 
> sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been 
> written during the sync()), it is moved to RH_CLEAN state and the on-disk 
> bit for the region is turned off.

Sounds good except that it uses sync()! Is there a way to sync only
pages related to a certain block device? How hard it is to implement
such an interface?

--Malahal.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-02 20:23                     ` Desynchronizing dm-raid1 Mikulas Patocka
  2008-04-02 22:13                       ` Mikulas Patocka
  2008-04-03  1:40                       ` malahal
@ 2008-04-03  9:19                       ` Heinz Mauelshagen
  2008-04-03 14:21                         ` malahal
  2008-04-07 23:38                         ` Mikulas Patocka
  2 siblings, 2 replies; 30+ messages in thread
From: Heinz Mauelshagen @ 2008-04-03  9:19 UTC (permalink / raw)
  To: device-mapper development; +Cc: Alasdair G Kergon


See below [HM].

On Wed, Apr 02, 2008 at 04:23:41PM -0400, Mikulas Patocka wrote:
> Hi
>
> Unfortunatelly, the bug with desychnronizing raid1 that someone pointed out 
> on Monday, is real. The bug happens when you modify the page while its 
> being written to raid1 device --- old version can be written to one mirror 
> leg, the new versions to the other mirror leg. Raid1 code does not notice 
> this, marks the region clean after the writes finish, and the volume stays 
> desynchronized.
>
> The possibilities, how data can be modified while they are being written.
>
> 1. an application does O_DIRECT IO and modifies the memory underway.
>
> --- this is a problem of the application and we don't have to care about 
> it.
>
> 2. an application maps file for writing. pdflush or kswapd daemon writes 
> the page on background while the application is modifying it.
>
> 3. an application writes to a page with write() syscall. This syscall can 
> race with pdflush or kswapd as well.
>
> 4. a filesystem modifies the buffer while its being written by pdflush or 
> kswapd daemons.
>
>
> The pdflush and kswapd daemons run in background and do periodic writes of 
> the modified data. pdflush is triggered regularly and writes data in 
> specified interval (about 30 seconds), so that in case of crash, the image 
> on disk is not too old. kswapd is triggered when the free memory goes low 
> --- it writes file pages and filesystem buffers too.
>
> In cases 2,3,4 the data may be modified while they are being written, but 
> the kernel writes them later again. The sequence is something like:
> clear dirty bit
> submit IO
> --- if the data are modified while the IO is in progress, the dirty bit is 
> turned on again and the data will be written later and possible data 
> corruption is corrected. --- so as long as the system does not crash, there 
> can't be desynchronized mirror.
>
> But if the system crashes before the data are written second time, the 
> blocks may stay desynchronized.
>
> An example of data corruption on ext2:
>
> We have a dirty bitmap buffer
> Pdflush clears the dirty flag and starts writing the buffer
> The write is submitted to dm-raid1, it makes two requests and submits them 
> to two mirror devices
>
> This operation races with another thread allocating a block on ext2 and 
> doing:

[HM] And taking out a copy unlocked in the RAID driver ain't help
application data integrity, because it could still change the data while
the RAID driver is copying, hence leading to coherency on the
RAID set but holding incorrect application data.
One can argue that this is ok in case of a crash, because the application
failed to flush any page changes and hence has to be capable to recover
from this.
We always will end up with consistent mirrors (either on multiplicated
successful writes to all legs or after resynchronization of the mirror)
at the cost of internal caching of pages.

> ext2_new_blocks
> calling read_block_bitmap
> 	calling sb_getblk
> 	calling bh_uptodate_or_lock --- sees that the buffer is uptodate (even if 
> it's under write), so it returns.
> calling ext2_try_to_allocate_with_rsv
> 	calling ext2_try_to_allocate
> 		calling ext2_set_bit_atomic --- this modifies the bitmap
> 		*** now suppose that 2nd mirror device already finished
> 		its write and don't get updated bit, while 1st mirror
> 		device writes the updated bit to disk.
> calling mark_buffer_dirty --- this schedules new update of the buffer 
> (after several seconds)
>
> Both writes finished, dm-raid1 driver turns off the dirty bit for the 
> region.
>
> Before pdflush writes the buffer second time, we get a
> ***CRASH***
>
> After new boot, dm-raid1 doesn't update the region, because the region's 
> bit is off. fsck scans the device. It reads the bitmap from the first 
> device, sees that the bit is correctly set and doesn't write the bitmap.
>
> Some times later, the administrator removes the 1st disk, the kernel starts 
> reading from 2nd mirror. Ext2 allocates another file, it reads the bitmap 
> from the 2nd device, sees the bit is off and allocates another block there. 
> Now there is data corruption => two files pointing to the same block.
>
>
> Ideas how to fix it:
>
> 1. lock the buffers and unmap the pages while they are being written.
> --- upstream developers would likely reject it. No other driver than 
> dm-raid1 has problems with this and they wouldn't damp performance because 
> of one driver.

[HM] md RAID456 and dm RAID45 don't have the raid1 problem, because
they utilize stripe caches, hence tacking page copies. Application pages
can change nonetheless vs. stripe cache pages.

>
> 2. never turn the region dirty bit off until the filesystem is unmounted.
> --- simplest fix. If the computer crashes after a long time, it 
> resynchronizes the whole device. md-raid resynchronizes the whole device 
> after a crash too.

[HM] We wouldn't resync the whole device, just dirty regions.
Of course the whole device would be the worst case with a huge
write data set.

For obvious reasons this is not what we want performamce-wise...

>
> 3. turn off the bit if the block wasn't written in one pdflush period
> --- requires an interaction with pdflush, rather complex, I wouldn't 
> recommend it.
>
> 4. make more region states.
> --- If the region is in RH_DIRTY state and all writes drain, the state is 
> changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or 
> not). The disk dirty flag is kept.
> --- periodically (once in few minutes, so that it doesn't affect 
> performance much), the change all regions in RH_MAYBE_DIRTY state to 
> RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the 
> sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been 
> written during the sync()), it is moved to RH_CLEAN state and the on-disk 
> bit for the region is turned off.

[HM] This is essentially one technical approach for my comment on 2. above.
RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
if that state persists in order to trigger the dirty log update.

Heinz

>
> If one of the above scenarios 2,3,4 happened (modifying a buffer while it's 
> under the disk write), the the sync() would have written the buffer again 
> and kicked the region out of RH_CLEAN_CANDIDATE state. If the sync() didn't 
> touch the buffer than we are sure that both on-disk copies are 
> synchronized.
>
>
> Do you have any other ideas on this?
>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-03  9:19                       ` Heinz Mauelshagen
@ 2008-04-03 14:21                         ` malahal
  2008-04-07 14:25                           ` Heinz Mauelshagen
  2008-04-07 23:38                         ` Mikulas Patocka
  1 sibling, 1 reply; 30+ messages in thread
From: malahal @ 2008-04-03 14:21 UTC (permalink / raw)
  To: dm-devel

Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
> 
> [HM] md RAID456 and dm RAID45 don't have the raid1 problem, because
> they utilize stripe caches, hence tacking page copies. Application pages
> can change nonetheless vs. stripe cache pages.

I wish they didn't make copies of data pages for the sake of
performance! If they did make copies for all of their I/O, they don't
have this problem.

> > 4. make more region states.
> > --- If the region is in RH_DIRTY state and all writes drain, the state is 
> > changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or 
> > not). The disk dirty flag is kept.
> > --- periodically (once in few minutes, so that it doesn't affect 
> > performance much), the change all regions in RH_MAYBE_DIRTY state to 
> > RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the 
> > sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been 
> > written during the sync()), it is moved to RH_CLEAN state and the on-disk 
> > bit for the region is turned off.
> 
> [HM] This is essentially one technical approach for my comment on 2. above.
> RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
> to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
> if that state persists in order to trigger the dirty log update.

I don't think the state RH_MAYBE_DIRTY is superfluous.  If the region
state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.

--Malahal.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-03  1:40                       ` malahal
@ 2008-04-03 14:49                         ` Martin K. Petersen
  2008-04-07 17:05                         ` Martin K. Petersen
  2008-04-07 23:31                         ` Mikulas Patocka
  2 siblings, 0 replies; 30+ messages in thread
From: Martin K. Petersen @ 2008-04-03 14:49 UTC (permalink / raw)
  To: dm-devel

>>>>> "Malahal" == malahal  <malahal@us.ibm.com> writes:

>> 1. lock the buffers and unmap the pages while they are being
>> written.  --- upstream developers would likely reject it. No other
>> driver than dm-raid1 has problems with this and they wouldn't damp
>> performance because of one driver.

Malahal> Very few drivers require it, so how about an interface to
Malahal> lock the pages of an I/O available to drivers. Only needed
Malahal> RAID drivers would lock the I/O while it is in progress and
Malahal> they only pay the performance penalty.  mmap pages are a bit
Malahal> tricky. They need to go into read-only mode when an I/O is in
Malahal> progress. I know this would likely be rejected too!!!

I have exactly the same problem with the data integrity stuff I'm
working on.

Essentially a checksum gets generated when a bio is submitted and both
the I/O controller and the disk drive verify the checksum.

With ext2 in particular I often experience that the page (usually
containing directory metadata) has been modified before the controller
does the DMA.  And the I/O will then be rejected by the controller or
drive because the checksum doesn't match the data.

So this problem is not specific to DM/MD...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-03 14:21                         ` malahal
@ 2008-04-07 14:25                           ` Heinz Mauelshagen
  2008-04-07 15:41                             ` malahal
  0 siblings, 1 reply; 30+ messages in thread
From: Heinz Mauelshagen @ 2008-04-07 14:25 UTC (permalink / raw)
  To: dm-devel; +Cc: hjm

On Thu, Apr 03, 2008 at 07:21:54AM -0700, malahal@us.ibm.com wrote:
> Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
> > 
> > [HM] md RAID456 and dm RAID45 don't have the raid1 problem, because
> > they utilize stripe caches, hence tacking page copies. Application pages
> > can change nonetheless vs. stripe cache pages.
> 
> I wish they didn't make copies of data pages for the sake of
> performance! If they did make copies for all of their I/O, they don't
> have this problem.

Me too but it's mandatory to be able to calculate parity chunks ;)

> 
> > > 4. make more region states.
> > > --- If the region is in RH_DIRTY state and all writes drain, the state is 
> > > changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or 
> > > not). The disk dirty flag is kept.
> > > --- periodically (once in few minutes, so that it doesn't affect 
> > > performance much), the change all regions in RH_MAYBE_DIRTY state to 
> > > RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the 
> > > sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been 
> > > written during the sync()), it is moved to RH_CLEAN state and the on-disk 
> > > bit for the region is turned off.
> > 
> > [HM] This is essentially one technical approach for my comment on 2. above.
> > RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
> > to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
> > if that state persists in order to trigger the dirty log update.
> 
> I don't think the state RH_MAYBE_DIRTY is superfluous.  If the region
> state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
> happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
> region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.

Hrm, sound like a contradiction in your statement.
Either it stays RH_CLEAN_CANDIDATE because of no writes *or*
it's state-changing to RH_DIRTY, no ?

Heinz

> 
> --Malahal.
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-07 14:25                           ` Heinz Mauelshagen
@ 2008-04-07 15:41                             ` malahal
  2008-04-07 23:27                               ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: malahal @ 2008-04-07 15:41 UTC (permalink / raw)
  To: dm-devel

Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
> > > RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
> > > to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
> > > if that state persists in order to trigger the dirty log update.

If I understand your above description: a region's state is set to
RH_DIRTY when an I/O is scheduled in the region and is set to
RH_CLEAN_CANDIDATE when all I/O is completed. In other words, a region's
state is RH_CLEAN_CANDIDATE when there is no pending I/O to that region.
Did I get it right so far?

Then we invoke sync(). Now, if the region's state is RH_CLEAN_CANDIDATE,
you set the region's state to RH_CLEAN. If the region's state is
anything other than RH_CLEAN_CANDIDATE, you don't do anything. Am I
correct?


> > I don't think the state RH_MAYBE_DIRTY is superfluous.  If the region
> > state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
> > happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
> > region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.
> 
> Hrm, sound like a contradiction in your statement.
> Either it stays RH_CLEAN_CANDIDATE because of no writes *or*
> it's state-changing to RH_DIRTY, no ?

The state would be RH_CLEAN_CANDIDATE if there were ***NO*** writes as
part of sync(). The next statement only describes what would happen if
there were any writes as part of sync().

--Malahal.
PS: Any comments from the original submitter if he thinks the state
is really superfluous?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-03  1:40                       ` malahal
  2008-04-03 14:49                         ` Martin K. Petersen
@ 2008-04-07 17:05                         ` Martin K. Petersen
  2008-04-07 17:22                           ` malahal
  2008-05-05 21:45                           ` Mikulas Patocka
  2008-04-07 23:31                         ` Mikulas Patocka
  2 siblings, 2 replies; 30+ messages in thread
From: Martin K. Petersen @ 2008-04-07 17:05 UTC (permalink / raw)
  To: device-mapper development

>>>>> "Malahal" == malahal  <malahal@us.ibm.com> writes:

[I sent this last week but it never made it to the list]

Malahal> Very few drivers require it, so how about an interface to
Malahal> lock the pages of an I/O available to drivers. Only needed
Malahal> RAID drivers would lock the I/O while it is in progress and
Malahal> they only pay the performance penalty.  mmap pages are a bit
Malahal> tricky. They need to go into read-only mode when an I/O is in
Malahal> progress. I know this would likely be rejected too!!!

I have exactly the same problem with the data integrity stuff I'm
working on.

Essentially a checksum gets generated when a bio is submitted, and
both the I/O controller and the disk drive verify the checksum.

With ext2 in particular I often experience that the page (usually
containing directory metadata) has been modified before the controller
does the DMA.  And the I/O will then be rejected by the controller or
drive because the checksum doesn't match the data.

So this problem is not specific to DM/MD...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-07 17:05                         ` Martin K. Petersen
@ 2008-04-07 17:22                           ` malahal
  2008-04-07 17:44                             ` Martin K. Petersen
  2008-05-05 21:45                           ` Mikulas Patocka
  1 sibling, 1 reply; 30+ messages in thread
From: malahal @ 2008-04-07 17:22 UTC (permalink / raw)
  To: dm-devel

Martin K. Petersen [mkp@mkp.net] wrote:
> >>>>> "Malahal" == malahal  <malahal@us.ibm.com> writes:
> 
> [I sent this last week but it never made it to the list]
> 
> Malahal> Very few drivers require it, so how about an interface to
> Malahal> lock the pages of an I/O available to drivers. Only needed
> Malahal> RAID drivers would lock the I/O while it is in progress and
> Malahal> they only pay the performance penalty.  mmap pages are a bit
> Malahal> tricky. They need to go into read-only mode when an I/O is in
> Malahal> progress. I know this would likely be rejected too!!!
> 
> I have exactly the same problem with the data integrity stuff I'm
> working on.
> 
> Essentially a checksum gets generated when a bio is submitted, and
> both the I/O controller and the disk drive verify the checksum.
> 
> With ext2 in particular I often experience that the page (usually
> containing directory metadata) has been modified before the controller
> does the DMA.  And the I/O will then be rejected by the controller or
> drive because the checksum doesn't match the data.

Your problem is very similar to an iSCSI problem sumitted here:
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=137902

Fortunately, you can detect the problem and the I/O can be retried if
possible. In the RAID case, it goes undetected until you hit the
eventual corruption!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-07 17:22                           ` malahal
@ 2008-04-07 17:44                             ` Martin K. Petersen
  0 siblings, 0 replies; 30+ messages in thread
From: Martin K. Petersen @ 2008-04-07 17:44 UTC (permalink / raw)
  To: dm-devel

>>>>> "Malahal" == malahal  <malahal@us.ibm.com> writes:

Malahal> Your problem is very similar to an iSCSI problem sumitted
Malahal> here:
Malahal> http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=137902

Can't see it without a netapp account...


Malahal> Fortunately, you can detect the problem and the I/O can be
Malahal> retried if possible.

That's not entirely trivial.  Obviously the original data that matches
the checksum is gone.  And I don't want to blindly regenerate the
checksum from the new data (how do I know this was an "ok" kind of
corruption?).

The only one that has that knowledge is the filesystem.  And if the
filesystem needs to be integrity-aware, I'd much rather have it Do The
Right Thing than teach it to inspect and reissue I/Os that come back
with -EDANGERWILLROBINSON.

With ext2 at least this is not some rare corner case.  It happens
hundreds of times during an unpack of a kernel tarball.  We'd end up
burning many, many cycles doing retries.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-07 15:41                             ` malahal
@ 2008-04-07 23:27                               ` Mikulas Patocka
  2008-04-08  0:04                                 ` malahal
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-04-07 23:27 UTC (permalink / raw)
  To: device-mapper development

On Mon, 7 Apr 2008, malahal@us.ibm.com wrote:

> Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
>>>> RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
>>>> to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
>>>> if that state persists in order to trigger the dirty log update.
>
> If I understand your above description: a region's state is set to
> RH_DIRTY when an I/O is scheduled in the region and is set to
> RH_CLEAN_CANDIDATE when all I/O is completed. In other words, a region's
> state is RH_CLEAN_CANDIDATE when there is no pending I/O to that region.
> Did I get it right so far?
>
> Then we invoke sync(). Now, if the region's state is RH_CLEAN_CANDIDATE,

The problem is, that if sync() wrote to that region, it waits until all 
writes drain and sets the region to RH_CLEAN_CANDIDATE too. So we couldn't 
distinguish the cases, if the sync() wrote to the region or not.

During this sync() write, anyone else may modify the data too, so if the 
sync() touched the region, we don't know if it's in-sync or isn't.

> you set the region's state to RH_CLEAN. If the region's state is
> anything other than RH_CLEAN_CANDIDATE, you don't do anything. Am I
> correct?
>
>>> I don't think the state RH_MAYBE_DIRTY is superfluous.  If the region
>>> state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
>>> happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
>>> region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.
>>
>> Hrm, sound like a contradiction in your statement.
>> Either it stays RH_CLEAN_CANDIDATE because of no writes *or*
>> it's state-changing to RH_DIRTY, no ?
>
> The state would be RH_CLEAN_CANDIDATE if there were ***NO*** writes as
> part of sync(). The next statement only describes what would happen if
> there were any writes as part of sync().

The sync() does write just like any other process, so it may create 
RH_CLEAN_CANDIDATE state too. To check that the region was not written 
during sync(), you need an extra state and make sure that no other code 
sets this state, except your syncing thread.

Mikulas

> --Malahal.
> PS: Any comments from the original submitter if he thinks the state
> is really superfluous?
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-03  1:40                       ` malahal
  2008-04-03 14:49                         ` Martin K. Petersen
  2008-04-07 17:05                         ` Martin K. Petersen
@ 2008-04-07 23:31                         ` Mikulas Patocka
  2 siblings, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2008-04-07 23:31 UTC (permalink / raw)
  To: device-mapper development

On Wed, 2 Apr 2008, malahal@us.ibm.com wrote:

> Mikulas Patocka [mpatocka@redhat.com] wrote:
>> Ideas how to fix it:
>>
>> 1. lock the buffers and unmap the pages while they are being written.
>> --- upstream developers would likely reject it. No other driver than
>> dm-raid1 has problems with this and they wouldn't damp performance because
>> of one driver.
>
> Very few drivers require it, so how about an interface to lock the pages
> of an I/O available to drivers.

Two rules of locking (one for RAID and the other for non-RAID) would 
create many bugs.

> Only needed RAID drivers would lock the
> I/O while it is in progress and they only pay the performance penalty.
> mmap pages are a bit tricky. They need to go into read-only mode when an
> I/O is in progress. I know this would likely be rejected too!!!
>
>> 4. make more region states.
>> --- If the region is in RH_DIRTY state and all writes drain, the state is
>> changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or
>> not). The disk dirty flag is kept.
>> --- periodically (once in few minutes, so that it doesn't affect
>> performance much), the change all regions in RH_MAYBE_DIRTY state to
>> RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the
>> sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been
>> written during the sync()), it is moved to RH_CLEAN state and the on-disk
>> bit for the region is turned off.
>
> Sounds good except that it uses sync()! Is there a way to sync only
> pages related to a certain block device? How hard it is to implement
> such an interface?

No, because the dirty pages may be everywhere. I.e. on filesystem you 
create loopback device with lvm2 devices with another filesystem, 
containing another loopback ... etc.

You'd need to build a DAG of dependent devices in the kernel and there is 
no such graph currently.

If you do sync() once 15 minutes or so, I think it wouldn't have 
performance impact.

Mikulas

> --Malahal.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-03  9:19                       ` Heinz Mauelshagen
  2008-04-03 14:21                         ` malahal
@ 2008-04-07 23:38                         ` Mikulas Patocka
  1 sibling, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2008-04-07 23:38 UTC (permalink / raw)
  To: mauelshagen, device-mapper development; +Cc: Alasdair G Kergon

On Thu, 3 Apr 2008, Heinz Mauelshagen wrote:

>
> See below [HM].
>
> On Wed, Apr 02, 2008 at 04:23:41PM -0400, Mikulas Patocka wrote:
>> Hi
>>
>> Unfortunatelly, the bug with desychnronizing raid1 that someone pointed out
>> on Monday, is real. The bug happens when you modify the page while its
>> being written to raid1 device --- old version can be written to one mirror
>> leg, the new versions to the other mirror leg. Raid1 code does not notice
>> this, marks the region clean after the writes finish, and the volume stays
>> desynchronized.
>>
>> The possibilities, how data can be modified while they are being written.
>>
>> 1. an application does O_DIRECT IO and modifies the memory underway.
>>
>> --- this is a problem of the application and we don't have to care about
>> it.
>>
>> 2. an application maps file for writing. pdflush or kswapd daemon writes
>> the page on background while the application is modifying it.
>>
>> 3. an application writes to a page with write() syscall. This syscall can
>> race with pdflush or kswapd as well.
>>
>> 4. a filesystem modifies the buffer while its being written by pdflush or
>> kswapd daemons.
>>
>>
>> The pdflush and kswapd daemons run in background and do periodic writes of
>> the modified data. pdflush is triggered regularly and writes data in
>> specified interval (about 30 seconds), so that in case of crash, the image
>> on disk is not too old. kswapd is triggered when the free memory goes low
>> --- it writes file pages and filesystem buffers too.
>>
>> In cases 2,3,4 the data may be modified while they are being written, but
>> the kernel writes them later again. The sequence is something like:
>> clear dirty bit
>> submit IO
>> --- if the data are modified while the IO is in progress, the dirty bit is
>> turned on again and the data will be written later and possible data
>> corruption is corrected. --- so as long as the system does not crash, there
>> can't be desynchronized mirror.
>>
>> But if the system crashes before the data are written second time, the
>> blocks may stay desynchronized.
>>
>> An example of data corruption on ext2:
>>
>> We have a dirty bitmap buffer
>> Pdflush clears the dirty flag and starts writing the buffer
>> The write is submitted to dm-raid1, it makes two requests and submits them
>> to two mirror devices
>>
>> This operation races with another thread allocating a block on ext2 and
>> doing:
>
> [HM] And taking out a copy unlocked in the RAID driver ain't help
> application data integrity, because it could still change the data while
> the RAID driver is copying, hence leading to coherency on the
> RAID set but holding incorrect application data.
> One can argue that this is ok in case of a crash, because the application
> failed to flush any page changes

The application has no way of knowing that it's data is being written. you 
call write(); fsync(); --- and if write() races with pdflush, the data may 
be very well written on that write() call. You, as an application 
developer, never know. You can only be sure that the data is certainly 
written when fsync() returns. But it may be written earlier.

> and hence has to be capable to recover from this.

Applications (such as databases) well recover from the situation that the 
data were written or from the situation that the data were not written.

But they don't recover from the situation when read() to the same data 
after a crash returns randomly old data on new data (depending on which 
mirror leg was selected for read).

> We always will end up with consistent mirrors (either on multiplicated
> successful writes to all legs or after resynchronization of the mirror)
> at the cost of internal caching of pages.

The problem is that we turn off the dirty bit after we finish write to 
both legs and never resync this region after a crash.

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-07 23:27                               ` Mikulas Patocka
@ 2008-04-08  0:04                                 ` malahal
  0 siblings, 0 replies; 30+ messages in thread
From: malahal @ 2008-04-08  0:04 UTC (permalink / raw)
  To: dm-devel

Mikulas Patocka [mpatocka@redhat.com] wrote:
> On Mon, 7 Apr 2008, malahal@us.ibm.com wrote:
>
>>>> I don't think the state RH_MAYBE_DIRTY is superfluous.  If the region
>>>> state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
>>>> happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
>>>> region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.
>>>
>>> Hrm, sound like a contradiction in your statement.
>>> Either it stays RH_CLEAN_CANDIDATE because of no writes *or*
>>> it's state-changing to RH_DIRTY, no ?
>>
>> The state would be RH_CLEAN_CANDIDATE if there were ***NO*** writes as
>> part of sync(). The next statement only describes what would happen if
>> there were any writes as part of sync().
>
> The sync() does write just like any other process, so it may create 
> RH_CLEAN_CANDIDATE state too. To check that the region was not written 
> during sync(), you need an extra state and make sure that no other code 
> sets this state, except your syncing thread.

Exactly my point. Thanks for putting it better.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-04-07 17:05                         ` Martin K. Petersen
  2008-04-07 17:22                           ` malahal
@ 2008-05-05 21:45                           ` Mikulas Patocka
  2008-05-06 10:29                               ` Herbert Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-05 21:45 UTC (permalink / raw)
  To: linux-crypto; +Cc: device-mapper development



On Mon, 7 Apr 2008, Martin K. Petersen wrote:

>>>>>> "Malahal" == malahal  <malahal@us.ibm.com> writes:
>
> [I sent this last week but it never made it to the list]
>
> Malahal> Very few drivers require it, so how about an interface to
> Malahal> lock the pages of an I/O available to drivers. Only needed
> Malahal> RAID drivers would lock the I/O while it is in progress and
> Malahal> they only pay the performance penalty.  mmap pages are a bit
> Malahal> tricky. They need to go into read-only mode when an I/O is in
> Malahal> progress. I know this would likely be rejected too!!!
>
> I have exactly the same problem with the data integrity stuff I'm
> working on.
>
> Essentially a checksum gets generated when a bio is submitted, and
> both the I/O controller and the disk drive verify the checksum.
>
> With ext2 in particular I often experience that the page (usually
> containing directory metadata) has been modified before the controller
> does the DMA.  And the I/O will then be rejected by the controller or
> drive because the checksum doesn't match the data.
>
> So this problem is not specific to DM/MD...
>
> -- 
> Martin K. Petersen	Oracle Linux Engineering

BTW: is it guaranteed for crypto functions that they read the source data 
only once?

I.e. if you encrypt a block while another CPU modifies it, and then 
decrypt it, is it guaranteed that each byte of the result will be either 
old byte or new byte of the original data?

If not, we have a subtle case of disk corruption here too. (imagine that 
you update for example atime field in inode while the block is being 
written and the crypto interface will trash the inode block).

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Desynchronizing dm-raid1
  2008-05-05 21:45                           ` Mikulas Patocka
@ 2008-05-06 10:29                               ` Herbert Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Herbert Xu @ 2008-05-06 10:29 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-crypto, dm-devel, malahal

Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> BTW: is it guaranteed for crypto functions that they read the source data 
> only once?

There is no such guarantee.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Desynchronizing dm-raid1
@ 2008-05-06 10:29                               ` Herbert Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Herbert Xu @ 2008-05-06 10:29 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-crypto, dm-devel, malahal

Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> BTW: is it guaranteed for crypto functions that they read the source data 
> only once?

There is no such guarantee.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-06 10:29                               ` Herbert Xu
  (?)
@ 2008-05-06 22:50                               ` Mikulas Patocka
  2008-05-13  3:28                                 ` Mikulas Patocka
  -1 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-06 22:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: dm-devel, linux-crypto

On Tue, 6 May 2008, Herbert Xu wrote:

> Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>> BTW: is it guaranteed for crypto functions that they read the source data
>> only once?
>
> There is no such guarantee.
>
> Cheers,

So we'll have to copy the sector to temporary space before encrypting it.

I'll look at it.

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-06 22:50                               ` Mikulas Patocka
@ 2008-05-13  3:28                                 ` Mikulas Patocka
  2008-05-13  3:38                                   ` [dm-devel] " Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-13  3:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: dm-devel, linux-crypto

On Tue, 6 May 2008, Mikulas Patocka wrote:

> On Tue, 6 May 2008, Herbert Xu wrote:
>
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> 
>>> BTW: is it guaranteed for crypto functions that they read the source data
>>> only once?
>> 
>> There is no such guarantee.
>> 
>> Cheers,
>
> So we'll have to copy the sector to temporary space before encrypting it.
>
> I'll look at it.
>
> Mikulas

Do you think that it would make sense to add a flag to crypto API that 
says that the encryption function reads the input only once? --- so that 
we wouldn't have to copy the data when using an algorithm which doesn't 
need it.

Or do you thint it would be useless (all major disk-encryption algorithms 
read input buffer more times?) or it would create too much code 
complexity?

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Desynchronizing dm-raid1
  2008-05-13  3:28                                 ` Mikulas Patocka
@ 2008-05-13  3:38                                   ` Herbert Xu
  2008-05-13 20:35                                     ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2008-05-13  3:38 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-crypto, dm-devel, malahal

On Mon, May 12, 2008 at 11:28:44PM -0400, Mikulas Patocka wrote:
> 
> Or do you thint it would be useless (all major disk-encryption algorithms 
> read input buffer more times?) or it would create too much code 
> complexity?

I'm open to this approach.  As long as you do the work and come up
with the patch that is :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-13  3:38                                   ` [dm-devel] " Herbert Xu
@ 2008-05-13 20:35                                     ` Mikulas Patocka
  2008-05-14  1:14                                       ` [dm-devel] " Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-13 20:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: dm-devel, linux-crypto



On Tue, 13 May 2008, Herbert Xu wrote:

> On Mon, May 12, 2008 at 11:28:44PM -0400, Mikulas Patocka wrote:
>>
>> Or do you thint it would be useless (all major disk-encryption algorithms
>> read input buffer more times?) or it would create too much code
>> complexity?
>
> I'm open to this approach.  As long as you do the work and come up
> with the patch that is :)
>
> Cheers,

And where would you propose to place this bit?

One possibility would be struct crypto_tfm->crt_flags
Another possibility is struct crypto_alg->cra_flags

Can chaining mode change the value of the flag? (I presume that yes)

So make a bit in struct crypto_alg. And propagate it to 
crypto_alg->cra_flags of chaining modes that read input data only once?

I'm not familiar with Linux crypto API, so the location of the flag should 
be selected by someone who knows it well.

Here I'm sending the patch for dm-crypt. To optimize it further, you 
should add the test after bio_data_dir(ctx->bio_in) == WRITE test --- i.e. 
if the algorithm is safe to read input only once, the smaller and faster 
branch can be executed.

Mikukas

---

Copy data when writing to the device.

This patch fixes damaged blocks on encrypted device when the computer crashes.

Linux IO architecture allows the data to be modified while they are being
written.

While we are processing write requests, the data may change, and it is expected,
that each byte written to the device is either old byte or new byte.

Some crypto functions may read the input buffer more than once and so they'll
produce garbage if the buffer changes under them. When this garbage is
read again and decrypted, it may produce damage in all bytes in the block. This
damage is visible only if the computer crashes; if it didn't crash, the memory
manager writes correct buffer or page contents some times later.

So we first copy data to a temporary buffer with memcpy and then encrypt them
in place.

Mikulas Patocka <mpatocka@redhat.com>

---
  drivers/md/dm-crypt.c |   14 ++++++++++++--
  1 file changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6.25.3/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.25.3.orig/drivers/md/dm-crypt.c	2008-05-13 21:48:32.000000000 +0200
+++ linux-2.6.25.3/drivers/md/dm-crypt.c	2008-05-13 22:14:12.000000000 +0200
@@ -360,8 +360,18 @@ static int crypt_convert_block(struct cr
  			 crypto_ablkcipher_alignmask(cc->tfm) + 1);

  	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in->bv_page, 1 << SECTOR_SHIFT,
-		    bv_in->bv_offset + ctx->offset_in);
+	if (bio_data_dir(ctx->bio_in) == WRITE) {
+		char *page_in = kmap_atomic(bv_in->bv_page, KM_USER0);
+		char *page_out = kmap_atomic(bv_out->bv_page, KM_USER1);
+		memcpy(page_out + bv_out->bv_offset + ctx->offset_out, page_in + bv_in->bv_offset + ctx->offset_in, 1 << SECTOR_SHIFT);
+		kunmap_atomic(page_in, KM_USER0);
+		kunmap_atomic(page_out, KM_USER1);
+
+		sg_set_page(&dmreq->sg_in, bv_out->bv_page, 1 << SECTOR_SHIFT,
+			    bv_out->bv_offset + ctx->offset_out);
+	} else
+		sg_set_page(&dmreq->sg_in, bv_in->bv_page, 1 << SECTOR_SHIFT,
+			    bv_in->bv_offset + ctx->offset_in);

  	sg_init_table(&dmreq->sg_out, 1);
  	sg_set_page(&dmreq->sg_out, bv_out->bv_page, 1 << SECTOR_SHIFT,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Desynchronizing dm-raid1
  2008-05-13 20:35                                     ` Mikulas Patocka
@ 2008-05-14  1:14                                       ` Herbert Xu
  2008-05-22  2:18                                         ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2008-05-14  1:14 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-crypto, dm-devel, malahal

On Tue, May 13, 2008 at 04:35:03PM -0400, Mikulas Patocka wrote:
>
> And where would you propose to place this bit?
> 
> One possibility would be struct crypto_tfm->crt_flags
> Another possibility is struct crypto_alg->cra_flags

The latter definitely because this is an algorithm property.

> Can chaining mode change the value of the flag? (I presume that yes)

If you mean templates like CBC then it depends.  You should
set it to zero by default for safety but most of them should
be able to turn it on once audited.

If it turns out that the majority of algorithms support this,
you could even decide to only select those algorithms that do.
Suppose your bit is

	CRYPTO_ALG_FOO

then you could do

	crypto_alloc_blkcipher(name, CRYPTO_ALG_FOO, CRYPTO_ALG_FOO)

to demand only those algorithms that comply.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-14  1:14                                       ` [dm-devel] " Herbert Xu
@ 2008-05-22  2:18                                         ` Mikulas Patocka
  2008-05-22  2:42                                           ` [dm-devel] " Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-22  2:18 UTC (permalink / raw)
  To: Herbert Xu, Alasdair G Kergon; +Cc: dm-devel, linux-crypto

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1529 bytes --]



On Wed, 14 May 2008, Herbert Xu wrote:

> On Tue, May 13, 2008 at 04:35:03PM -0400, Mikulas Patocka wrote:
>>
>> And where would you propose to place this bit?
>>
>> One possibility would be struct crypto_tfm->crt_flags
>> Another possibility is struct crypto_alg->cra_flags
>
> The latter definitely because this is an algorithm property.
>
>> Can chaining mode change the value of the flag? (I presume that yes)
>
> If you mean templates like CBC then it depends.  You should
> set it to zero by default for safety but most of them should
> be able to turn it on once audited.
>
> If it turns out that the majority of algorithms support this,
> you could even decide to only select those algorithms that do.
> Suppose your bit is
>
> 	CRYPTO_ALG_FOO
>
> then you could do
>
> 	crypto_alloc_blkcipher(name, CRYPTO_ALG_FOO, CRYPTO_ALG_FOO)
>
> to demand only those algorithms that comply.
>
> Cheers,

Hi

Here I send the patches, the first one copied data in dm-crypt 
unconditionally. The second one adds a flag to coplying algorithms. The 
third one skips the copy for complying algorithms. The fourth one is 
removing useless increment in arc4 that I found while reviewing the 
ciphers.

All the ciphers comply, so the bug is only a theroretical issue (but I 
didn't check assembler versions --- they should be checked by the person 
who wrote them, assembler is write-only language).

Please review my changes to crypto code, I am not crypto developer and I 
do not understand it as well as people maintaining it.

Mikulas

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2242 bytes --]

Copy data when writing to the device.

This patch fixes damaged blocks on encrypted device when the computer crashes.

Linux IO architecture allows the data to be modified while they are being
written.

While we are processing write requests, the data may change, and it is expected,
that each byte written to the device is either old byte or new byte.

Some crypto functions may read the input buffer more than once and so they'll
produce garbage if the buffer changes under them. When this garbage is
read again and decrypted, it may produce damage in all bytes in the block. This
damage is visible only if the computer crashes; if it didn't crash, the memory
manager writes correct buffer or page contents some times later.

So we first copy data to a temporary buffer with memcpy and then encrypt them
in place.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-crypt.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6.25.3/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.25.3.orig/drivers/md/dm-crypt.c	2008-05-14 18:41:10.000000000 +0200
+++ linux-2.6.25.3/drivers/md/dm-crypt.c	2008-05-14 18:41:11.000000000 +0200
@@ -361,8 +361,18 @@ static int crypt_convert_block(struct cr
 			 crypto_ablkcipher_alignmask(cc->tfm) + 1);
 
 	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in->bv_page, 1 << SECTOR_SHIFT,
-		    bv_in->bv_offset + ctx->offset_in);
+	if (bio_data_dir(ctx->bio_in) == WRITE) {
+		char *page_in = kmap_atomic(bv_in->bv_page, KM_USER0);
+		char *page_out = kmap_atomic(bv_out->bv_page, KM_USER1);
+		memcpy(page_out + bv_out->bv_offset + ctx->offset_out, page_in + bv_in->bv_offset + ctx->offset_in, 1 << SECTOR_SHIFT);
+		kunmap_atomic(page_in, KM_USER0);
+		kunmap_atomic(page_out, KM_USER1);
+
+		sg_set_page(&dmreq->sg_in, bv_out->bv_page, 1 << SECTOR_SHIFT,
+			    bv_out->bv_offset + ctx->offset_out);
+	} else
+		sg_set_page(&dmreq->sg_in, bv_in->bv_page, 1 << SECTOR_SHIFT,
+			    bv_in->bv_offset + ctx->offset_in);
 
 	sg_init_table(&dmreq->sg_out, 1);
 	sg_set_page(&dmreq->sg_out, bv_out->bv_page, 1 << SECTOR_SHIFT,

[-- Attachment #3: Type: TEXT/PLAIN, Size: 18242 bytes --]

Add a flag that tells that the algorithm reads the data only once when
encrypting.

It will be used by dm-crypt to skip a copy operation if not needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 crypto/aes_generic.c     |    2 +-
 crypto/anubis.c          |    2 +-
 crypto/arc4.c            |    2 +-
 crypto/blkcipher.c       |    2 +-
 crypto/blowfish.c        |    2 +-
 crypto/camellia.c        |    2 +-
 crypto/cast5.c           |    2 +-
 crypto/cast6.c           |    2 +-
 crypto/cbc.c             |    2 +-
 crypto/cryptd.c          |    2 +-
 crypto/crypto_null.c     |    2 +-
 crypto/ctr.c             |    2 +-
 crypto/des_generic.c     |    4 ++--
 crypto/ecb.c             |    2 +-
 crypto/fcrypt.c          |    2 +-
 crypto/khazad.c          |    2 +-
 crypto/lrw.c             |    2 +-
 crypto/pcbc.c            |    2 +-
 crypto/salsa20_generic.c |    2 +-
 crypto/seed.c            |    2 +-
 crypto/serpent.c         |    4 ++--
 crypto/tea.c             |    4 ++--
 crypto/twofish.c         |    2 +-
 crypto/xts.c             |    2 +-
 include/linux/crypto.h   |    9 +++++++++
 25 files changed, 36 insertions(+), 27 deletions(-)

Index: linux-2.6.25.3/include/linux/crypto.h
===================================================================
--- linux-2.6.25.3.orig/include/linux/crypto.h	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/include/linux/crypto.h	2008-05-21 18:54:43.000000000 +0200
@@ -59,6 +59,15 @@
 #define CRYPTO_ALG_GENIV		0x00000200
 
 /*
+ * The encryption algorithm reads the data only once. So it can encrypt
+ * data while they change.
+ *
+ * This is useful for disk encryption, so that the sector doesn't have
+ * to be copied to temporary space before encrypting.
+ */
+#define CRYPTO_ALG_ENCRYPT_READ_ONCE	0x00000400
+
+/*
  * Transform masks and values (for crt_flags).
  */
 #define CRYPTO_TFM_REQ_MASK		0x000fff00
Index: linux-2.6.25.3/crypto/cbc.c
===================================================================
--- linux-2.6.25.3.orig/crypto/cbc.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/cbc.c	2008-05-21 19:32:49.000000000 +0200
@@ -234,7 +234,7 @@ static struct crypto_instance *crypto_cb
 	if (IS_ERR(inst))
 		goto out_put_alg;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER | (alg->cra_flags & CRYPTO_ALG_ENCRYPT_READ_ONCE);
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = alg->cra_blocksize;
 	inst->alg.cra_alignmask = alg->cra_alignmask;
Index: linux-2.6.25.3/crypto/ecb.c
===================================================================
--- linux-2.6.25.3.orig/crypto/ecb.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/ecb.c	2008-05-21 16:37:48.000000000 +0200
@@ -134,7 +134,7 @@ static struct crypto_instance *crypto_ec
 	if (IS_ERR(inst))
 		goto out_put_alg;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER | (alg->cra_flags & CRYPTO_ALG_ENCRYPT_READ_ONCE);
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = alg->cra_blocksize;
 	inst->alg.cra_alignmask = alg->cra_alignmask;
Index: linux-2.6.25.3/crypto/ctr.c
===================================================================
--- linux-2.6.25.3.orig/crypto/ctr.c	2008-05-21 16:43:23.000000000 +0200
+++ linux-2.6.25.3/crypto/ctr.c	2008-05-21 16:48:35.000000000 +0200
@@ -200,7 +200,7 @@ static struct crypto_instance *crypto_ct
 	if (IS_ERR(inst))
 		goto out;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE;
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = 1;
 	inst->alg.cra_alignmask = alg->cra_alignmask | (__alignof__(u32) - 1);
Index: linux-2.6.25.3/crypto/lrw.c
===================================================================
--- linux-2.6.25.3.orig/crypto/lrw.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/lrw.c	2008-05-21 17:37:11.000000000 +0200
@@ -247,7 +247,7 @@ static struct crypto_instance *alloc(str
 	if (IS_ERR(inst))
 		goto out_put_alg;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE;
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = alg->cra_blocksize;
 
Index: linux-2.6.25.3/crypto/xts.c
===================================================================
--- linux-2.6.25.3.orig/crypto/xts.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/xts.c	2008-05-21 17:38:23.000000000 +0200
@@ -230,7 +230,7 @@ static struct crypto_instance *alloc(str
 	if (IS_ERR(inst))
 		goto out_put_alg;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE;
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = alg->cra_blocksize;
 
Index: linux-2.6.25.3/crypto/aes_generic.c
===================================================================
--- linux-2.6.25.3.orig/crypto/aes_generic.c	2008-05-21 17:54:16.000000000 +0200
+++ linux-2.6.25.3/crypto/aes_generic.c	2008-05-21 17:54:27.000000000 +0200
@@ -434,7 +434,7 @@ static struct crypto_alg aes_alg = {
 	.cra_name		=	"aes",
 	.cra_driver_name	=	"aes-generic",
 	.cra_priority		=	100,
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	AES_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct crypto_aes_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/cryptd.c
===================================================================
--- linux-2.6.25.3.orig/crypto/cryptd.c	2008-05-21 17:41:48.000000000 +0200
+++ linux-2.6.25.3/crypto/cryptd.c	2008-05-21 17:42:32.000000000 +0200
@@ -238,7 +238,7 @@ static struct crypto_instance *cryptd_al
 	if (IS_ERR(inst))
 		goto out_put_alg;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC | (alg->cra_flags & CRYPTO_ALG_ENCRYPT_READ_ONCE);
 	inst->alg.cra_type = &crypto_ablkcipher_type;
 
 	inst->alg.cra_ablkcipher.ivsize = alg->cra_blkcipher.ivsize;
Index: linux-2.6.25.3/crypto/pcbc.c
===================================================================
--- linux-2.6.25.3.orig/crypto/pcbc.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/pcbc.c	2008-05-21 17:40:21.000000000 +0200
@@ -240,7 +240,7 @@ static struct crypto_instance *crypto_pc
 	if (IS_ERR(inst))
 		goto out_put_alg;
 
-	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER | (alg->cra_flags & CRYPTO_ALG_ENCRYPT_READ_ONCE);
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = alg->cra_blocksize;
 	inst->alg.cra_alignmask = alg->cra_alignmask;
Index: linux-2.6.25.3/crypto/salsa20_generic.c
===================================================================
--- linux-2.6.25.3.orig/crypto/salsa20_generic.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/salsa20_generic.c	2008-05-21 17:47:22.000000000 +0200
@@ -218,7 +218,7 @@ static struct crypto_alg alg = {
 	.cra_name           =   "salsa20",
 	.cra_driver_name    =   "salsa20-generic",
 	.cra_priority       =   100,
-	.cra_flags          =   CRYPTO_ALG_TYPE_BLKCIPHER,
+	.cra_flags          =   CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_type           =   &crypto_blkcipher_type,
 	.cra_blocksize      =   1,
 	.cra_ctxsize        =   sizeof(struct salsa20_ctx),
Index: linux-2.6.25.3/crypto/anubis.c
===================================================================
--- linux-2.6.25.3.orig/crypto/anubis.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/anubis.c	2008-05-21 17:56:04.000000000 +0200
@@ -673,7 +673,7 @@ static void anubis_decrypt(struct crypto
 
 static struct crypto_alg anubis_alg = {
 	.cra_name		=	"anubis",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	ANUBIS_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof (struct anubis_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/arc4.c
===================================================================
--- linux-2.6.25.3.orig/crypto/arc4.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/arc4.c	2008-05-21 18:00:38.000000000 +0200
@@ -72,7 +72,7 @@ static void arc4_crypt(struct crypto_tfm
 
 static struct crypto_alg arc4_alg = {
 	.cra_name		=	"arc4",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	ARC4_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct arc4_ctx),
 	.cra_module		=	THIS_MODULE,
Index: linux-2.6.25.3/crypto/blowfish.c
===================================================================
--- linux-2.6.25.3.orig/crypto/blowfish.c	2008-05-21 18:00:38.000000000 +0200
+++ linux-2.6.25.3/crypto/blowfish.c	2008-05-21 18:00:55.000000000 +0200
@@ -451,7 +451,7 @@ static int bf_setkey(struct crypto_tfm *
 
 static struct crypto_alg alg = {
 	.cra_name		=	"blowfish",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	BF_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct bf_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/camellia.c
===================================================================
--- linux-2.6.25.3.orig/crypto/camellia.c	2008-05-21 18:02:11.000000000 +0200
+++ linux-2.6.25.3/crypto/camellia.c	2008-05-21 18:02:24.000000000 +0200
@@ -1094,7 +1094,7 @@ static struct crypto_alg camellia_alg = 
 	.cra_name		=	"camellia",
 	.cra_driver_name	=	"camellia-generic",
 	.cra_priority		=	100,
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	CAMELLIA_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct camellia_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/cast5.c
===================================================================
--- linux-2.6.25.3.orig/crypto/cast5.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/cast5.c	2008-05-21 18:03:39.000000000 +0200
@@ -800,7 +800,7 @@ static int cast5_setkey(struct crypto_tf
 
 static struct crypto_alg alg = {
 	.cra_name 	= "cast5",
-	.cra_flags 	= CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags 	= CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize 	= CAST5_BLOCK_SIZE,
 	.cra_ctxsize 	= sizeof(struct cast5_ctx),
 	.cra_alignmask	= 3,
Index: linux-2.6.25.3/crypto/cast6.c
===================================================================
--- linux-2.6.25.3.orig/crypto/cast6.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/cast6.c	2008-05-21 18:04:32.000000000 +0200
@@ -512,7 +512,7 @@ static void cast6_decrypt(struct crypto_
 
 static struct crypto_alg alg = {
 	.cra_name = "cast6",
-	.cra_flags = CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags = CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize = CAST6_BLOCK_SIZE,
 	.cra_ctxsize = sizeof(struct cast6_ctx),
 	.cra_alignmask = 3,
Index: linux-2.6.25.3/crypto/crypto_null.c
===================================================================
--- linux-2.6.25.3.orig/crypto/crypto_null.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/crypto_null.c	2008-05-21 18:05:41.000000000 +0200
@@ -123,7 +123,7 @@ static struct crypto_alg skcipher_null =
 	.cra_name		=	"ecb(cipher_null)",
 	.cra_driver_name	=	"ecb-cipher_null",
 	.cra_priority		=	100,
-	.cra_flags		=	CRYPTO_ALG_TYPE_BLKCIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	NULL_BLOCK_SIZE,
 	.cra_type		=	&crypto_blkcipher_type,
 	.cra_ctxsize		=	0,
Index: linux-2.6.25.3/crypto/des_generic.c
===================================================================
--- linux-2.6.25.3.orig/crypto/des_generic.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/des_generic.c	2008-05-21 18:07:33.000000000 +0200
@@ -945,7 +945,7 @@ static void des3_ede_decrypt(struct cryp
 
 static struct crypto_alg des_alg = {
 	.cra_name		=	"des",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	DES_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct des_ctx),
 	.cra_module		=	THIS_MODULE,
@@ -961,7 +961,7 @@ static struct crypto_alg des_alg = {
 
 static struct crypto_alg des3_ede_alg = {
 	.cra_name		=	"des3_ede",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	DES3_EDE_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct des3_ede_ctx),
 	.cra_module		=	THIS_MODULE,
Index: linux-2.6.25.3/crypto/fcrypt.c
===================================================================
--- linux-2.6.25.3.orig/crypto/fcrypt.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/fcrypt.c	2008-05-21 18:08:16.000000000 +0200
@@ -391,7 +391,7 @@ static int fcrypt_setkey(struct crypto_t
 
 static struct crypto_alg fcrypt_alg = {
 	.cra_name		=	"fcrypt",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	8,
 	.cra_ctxsize		=	sizeof(struct fcrypt_ctx),
 	.cra_module		=	THIS_MODULE,
Index: linux-2.6.25.3/crypto/khazad.c
===================================================================
--- linux-2.6.25.3.orig/crypto/khazad.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/khazad.c	2008-05-21 18:09:07.000000000 +0200
@@ -848,7 +848,7 @@ static void khazad_decrypt(struct crypto
 
 static struct crypto_alg khazad_alg = {
 	.cra_name		=	"khazad",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	KHAZAD_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof (struct khazad_ctx),
 	.cra_alignmask		=	7,
Index: linux-2.6.25.3/crypto/seed.c
===================================================================
--- linux-2.6.25.3.orig/crypto/seed.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/seed.c	2008-05-21 18:09:52.000000000 +0200
@@ -444,7 +444,7 @@ static struct crypto_alg seed_alg = {
 	.cra_name		=	"seed",
 	.cra_driver_name	=	"seed-generic",
 	.cra_priority		=	100,
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	SEED_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct seed_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/serpent.c
===================================================================
--- linux-2.6.25.3.orig/crypto/serpent.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/serpent.c	2008-05-21 18:12:08.000000000 +0200
@@ -475,7 +475,7 @@ static void serpent_decrypt(struct crypt
 
 static struct crypto_alg serpent_alg = {
 	.cra_name		=	"serpent",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	SERPENT_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct serpent_ctx),
 	.cra_alignmask		=	3,
@@ -543,7 +543,7 @@ static void tnepres_decrypt(struct crypt
 
 static struct crypto_alg tnepres_alg = {
 	.cra_name		=	"tnepres",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	SERPENT_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof(struct serpent_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/tea.c
===================================================================
--- linux-2.6.25.3.orig/crypto/tea.c	2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.25.3/crypto/tea.c	2008-05-21 18:14:39.000000000 +0200
@@ -221,7 +221,7 @@ static void xeta_decrypt(struct crypto_t
 
 static struct crypto_alg tea_alg = {
 	.cra_name		=	"tea",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	TEA_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof (struct tea_ctx),
 	.cra_alignmask		=	3,
@@ -253,7 +253,7 @@ static struct crypto_alg xtea_alg = {
 
 static struct crypto_alg xeta_alg = {
 	.cra_name		=	"xeta",
-	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags		=	CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize		=	XTEA_BLOCK_SIZE,
 	.cra_ctxsize		=	sizeof (struct xtea_ctx),
 	.cra_alignmask		=	3,
Index: linux-2.6.25.3/crypto/twofish.c
===================================================================
--- linux-2.6.25.3.orig/crypto/twofish.c	2008-05-21 18:16:51.000000000 +0200
+++ linux-2.6.25.3/crypto/twofish.c	2008-05-21 18:16:59.000000000 +0200
@@ -183,7 +183,7 @@ static struct crypto_alg alg = {
 	.cra_name           =   "twofish",
 	.cra_driver_name    =   "twofish-generic",
 	.cra_priority       =   100,
-	.cra_flags          =   CRYPTO_ALG_TYPE_CIPHER,
+	.cra_flags          =   CRYPTO_ALG_TYPE_CIPHER | CRYPTO_ALG_ENCRYPT_READ_ONCE,
 	.cra_blocksize      =   TF_BLOCK_SIZE,
 	.cra_ctxsize        =   sizeof(struct twofish_ctx),
 	.cra_alignmask      =	3,
Index: linux-2.6.25.3/crypto/blkcipher.c
===================================================================
--- linux-2.6.25.3.orig/crypto/blkcipher.c	2008-05-21 19:43:31.000000000 +0200
+++ linux-2.6.25.3/crypto/blkcipher.c	2008-05-21 20:09:11.000000000 +0200
@@ -640,7 +640,7 @@ struct crypto_instance *skcipher_geniv_a
 	}
 
 	inst->alg.cra_flags = CRYPTO_ALG_TYPE_GIVCIPHER | CRYPTO_ALG_GENIV;
-	inst->alg.cra_flags |= alg->cra_flags & CRYPTO_ALG_ASYNC;
+	inst->alg.cra_flags |= alg->cra_flags & (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ENCRYPT_READ_ONCE);
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = alg->cra_blocksize;
 	inst->alg.cra_alignmask = alg->cra_alignmask;

[-- Attachment #4: Type: TEXT/PLAIN, Size: 1116 bytes --]

Use the flag CRYPTO_ALG_ENCRYPT_READ_ONCE. If the flag is set, we don't have
to copy when encrypting and we can revert to old behavior.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-crypt.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25.3/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.25.3.orig/drivers/md/dm-crypt.c	2008-05-21 20:09:05.000000000 +0200
+++ linux-2.6.25.3/drivers/md/dm-crypt.c	2008-05-21 20:12:03.000000000 +0200
@@ -361,7 +361,7 @@ static int crypt_convert_block(struct cr
 			 crypto_ablkcipher_alignmask(cc->tfm) + 1);
 
 	sg_init_table(&dmreq->sg_in, 1);
-	if (bio_data_dir(ctx->bio_in) == WRITE) {
+	if (bio_data_dir(ctx->bio_in) == WRITE && !(cc->tfm->base.__crt_alg->cra_flags & CRYPTO_ALG_ENCRYPT_READ_ONCE)) {
 		char *page_in = kmap_atomic(bv_in->bv_page, KM_USER0);
 		char *page_out = kmap_atomic(bv_out->bv_page, KM_USER1);
 		memcpy(page_out + bv_out->bv_offset + ctx->offset_out, page_in + bv_in->bv_offset + ctx->offset_in, 1 << SECTOR_SHIFT);

[-- Attachment #5: Type: TEXT/PLAIN, Size: 642 bytes --]

Remove an useless increment in arc4.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 crypto/arc4.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25.3/crypto/arc4.c
===================================================================
--- linux-2.6.25.3.orig/crypto/arc4.c	2008-05-21 17:57:24.000000000 +0200
+++ linux-2.6.25.3/crypto/arc4.c	2008-05-21 17:58:28.000000000 +0200
@@ -64,7 +64,7 @@ static void arc4_crypt(struct crypto_tfm
 	S[x] = b;
 	S[y] = a;
 	x = (x + 1) & 0xff;
-	*out++ = *in ^ S[(a + b) & 0xff];
+	*out = *in ^ S[(a + b) & 0xff];
 
 	ctx->x = x;
 	ctx->y = y;

[-- Attachment #6: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Desynchronizing dm-raid1
  2008-05-22  2:18                                         ` Mikulas Patocka
@ 2008-05-22  2:42                                           ` Herbert Xu
  2008-05-22 12:32                                             ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2008-05-22  2:42 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Alasdair G Kergon, linux-crypto, dm-devel, malahal

On Wed, May 21, 2008 at 10:18:43PM -0400, Mikulas Patocka wrote:
>
> All the ciphers comply, so the bug is only a theroretical issue (but I 
> didn't check assembler versions --- they should be checked by the person 
> who wrote them, assembler is write-only language).

Since every current algorithm sets the flag could you invert
its sense? Sorry to have to do this to you :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-22  2:42                                           ` [dm-devel] " Herbert Xu
@ 2008-05-22 12:32                                             ` Mikulas Patocka
  2008-05-22 23:53                                               ` [dm-devel] " Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-22 12:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: dm-devel, linux-crypto, Alasdair G Kergon

>> All the ciphers comply, so the bug is only a theroretical issue (but I
>> didn't check assembler versions --- they should be checked by the person
>> who wrote them, assembler is write-only language).
>
> Since every current algorithm sets the flag could you invert
> its sense? Sorry to have to do this to you :)
>
> Thanks,

There may be external modules.

If you don't set the flag when it should be set, nothing happens (just a 
slight performance drop), if you set the flag when it shouldn't be set, 
you get data corruption. So the safest way is this meaning of flag, so 
that not-yet-reviewed algorithms set the flag to 0 and prevent data 
corruption.

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Desynchronizing dm-raid1
  2008-05-22 12:32                                             ` Mikulas Patocka
@ 2008-05-22 23:53                                               ` Herbert Xu
  2008-05-23 14:59                                                 ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2008-05-22 23:53 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Alasdair G Kergon, linux-crypto, dm-devel, malahal

On Thu, May 22, 2008 at 08:32:45AM -0400, Mikulas Patocka wrote:
>
> There may be external modules.

Sorry but we don't support external modules.  They should be merged
upstream rather than distributed in the wild.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-22 23:53                                               ` [dm-devel] " Herbert Xu
@ 2008-05-23 14:59                                                 ` Mikulas Patocka
  2008-05-24  0:01                                                   ` Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-23 14:59 UTC (permalink / raw)
  To: Herbert Xu; +Cc: dm-devel, linux-crypto, Alasdair G Kergon

> On Thu, May 22, 2008 at 08:32:45AM -0400, Mikulas Patocka wrote:
>>
>> There may be external modules.
>
> Sorry but we don't support external modules.  They should be merged
> upstream rather than distributed in the wild.
>
> Cheers,

If you want to negate the meaning of the flag, then you have to write it 
yourself. I, as non-developer of crypto code, can prove that on given path 
the input data are read only once --- but I can't prove that on all paths 
and all possible chaining modes of algorithms the data are read once, 
because I don't know about all of them.

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-23 14:59                                                 ` Mikulas Patocka
@ 2008-05-24  0:01                                                   ` Herbert Xu
  2008-05-24 14:01                                                     ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2008-05-24  0:01 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, linux-crypto, Alasdair G Kergon

On Fri, May 23, 2008 at 10:59:33AM -0400, Mikulas Patocka wrote:
>
> If you want to negate the meaning of the flag, then you have to write it 
> yourself. I, as non-developer of crypto code, can prove that on given path 
> the input data are read only once --- but I can't prove that on all paths 
> and all possible chaining modes of algorithms the data are read once, 
> because I don't know about all of them.

Huh? Inverting it would give exactly the same result as your current
patch.  If you're not confident with it inverted, then I can't see
how you could be confident about the patch as it is.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Desynchronizing dm-raid1
  2008-05-24  0:01                                                   ` Herbert Xu
@ 2008-05-24 14:01                                                     ` Mikulas Patocka
  0 siblings, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2008-05-24 14:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: dm-devel, linux-crypto, Alasdair G Kergon

>> If you want to negate the meaning of the flag, then you have to write it
>> yourself. I, as non-developer of crypto code, can prove that on given path
>> the input data are read only once --- but I can't prove that on all paths
>> and all possible chaining modes of algorithms the data are read once,
>> because I don't know about all of them.
>
> Huh? Inverting it would give exactly the same result as your current
> patch.  If you're not confident with it inverted, then I can't see
> how you could be confident about the patch as it is.

If you don't set the bit when it should be set => you get slight 
performance drop (problem 1)

You set the bit when it shouldn't be set => you get data corruption 
(problem 2)


Problem 1 is acceptable, problem 2 is not.

So if I forgot some code path when the bit should be set, I created 
problem 1, but I am sure that I didn't create problem 2.


If you invert the meaning of the bit, you risk creating problem 2. And you 
need more review (by someone who understands what all structures are being 
created in crypto code).

Mikulas

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-05-24 14:01 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0803131238520.23136@porkchop.devel.redhat.com>
     [not found] ` <Pine.LNX.4.64.0803171338130.7598@porkchop.devel.redhat.com>
     [not found]   ` <47DEC402.10309@redhat.com>
     [not found]     ` <Pine.LNX.4.64.0803171611430.31803@porkchop.devel.redhat.com>
     [not found]       ` <20080317215631.GG29322@agk.fab.redhat.com>
     [not found]         ` <Pine.LNX.4.64.0803181921380.10939@porkchop.devel.redhat.com>
     [not found]           ` <20080318233955.GA12007@agk.fab.redhat.com>
     [not found]             ` <Pine.LNX.4.64.0803181942140.20490@porkchop.devel.redhat.com>
     [not found]               ` <20080319000241.GB12007@agk.fab.redhat.com>
     [not found]                 ` <Pine.LNX.4.64.0803182015001.21077@porkchop.devel.redhat.com>
     [not found]                   ` <20080319011757.GD12007@agk.fab.redhat.com>
2008-04-02 20:23                     ` Desynchronizing dm-raid1 Mikulas Patocka
2008-04-02 22:13                       ` Mikulas Patocka
2008-04-03  1:40                       ` malahal
2008-04-03 14:49                         ` Martin K. Petersen
2008-04-07 17:05                         ` Martin K. Petersen
2008-04-07 17:22                           ` malahal
2008-04-07 17:44                             ` Martin K. Petersen
2008-05-05 21:45                           ` Mikulas Patocka
2008-05-06 10:29                             ` [dm-devel] " Herbert Xu
2008-05-06 10:29                               ` Herbert Xu
2008-05-06 22:50                               ` Mikulas Patocka
2008-05-13  3:28                                 ` Mikulas Patocka
2008-05-13  3:38                                   ` [dm-devel] " Herbert Xu
2008-05-13 20:35                                     ` Mikulas Patocka
2008-05-14  1:14                                       ` [dm-devel] " Herbert Xu
2008-05-22  2:18                                         ` Mikulas Patocka
2008-05-22  2:42                                           ` [dm-devel] " Herbert Xu
2008-05-22 12:32                                             ` Mikulas Patocka
2008-05-22 23:53                                               ` [dm-devel] " Herbert Xu
2008-05-23 14:59                                                 ` Mikulas Patocka
2008-05-24  0:01                                                   ` Herbert Xu
2008-05-24 14:01                                                     ` Mikulas Patocka
2008-04-07 23:31                         ` Mikulas Patocka
2008-04-03  9:19                       ` Heinz Mauelshagen
2008-04-03 14:21                         ` malahal
2008-04-07 14:25                           ` Heinz Mauelshagen
2008-04-07 15:41                             ` malahal
2008-04-07 23:27                               ` Mikulas Patocka
2008-04-08  0:04                                 ` malahal
2008-04-07 23:38                         ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.