linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bill Davidsen <davidsen@tmr.com>
To: Michael Evans <mjevans1983@gmail.com>
Cc: Asdo <asdo@shiftmail.org>, Neil Brown <neilb@suse.de>,
	Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
	Steven Haigh <netwiz@crc.id.au>,
	Bryan Mesich <bryan.mesich@ndsu.edu>,
	Jon@ehardcastle.com, linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?
Date: Sat, 27 Feb 2010 19:01:20 -0500	[thread overview]
Message-ID: <4B89B250.1060707@tmr.com> (raw)
In-Reply-To: <4877c76c1002262201h31051c44r9d756e4969a71fbb@mail.gmail.com>

Michael Evans wrote:
> On Fri, Feb 26, 2010 at 2:20 PM, Asdo <asdo@shiftmail.org> wrote:
>   
>> Neil Brown wrote:
>>     
>>> Actually, I'm no longer convinced that the checksumming idea would work.
>>> If a mem-mapped page were written, that the app is updating every
>>> millisecond (i.e. less than the write latency), then every time a write
>>> completed the checksum would be different so we would have to reschedule
>>> the
>>> write, which would not be the correct behaviour at all.
>>> So I think that the only way to address this in the md layer is to copy
>>> the data and write the copy.  There is already code to copy the data for
>>> write-behind that could possible be leveraged to do a copy always.
>>>
>>>       
>> The concerns of slowdowns with copy could be addressed by making the copy a
>> runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/
>> interface where one can echo "1" to enable copies for this type of raid. Or
>> better 1 could be the default (slower but safer, or if not safer, at least
>> to avoid needless questions on mismatches on this ML by new users, and to
>> allow detection of REAL mismatches which can be due to cabling or defective
>> disks) and echoing 0 would increase performances at the cost of seeing lots
>> of false positive mismatches.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>     
>
> Isn't there some way of making the page copy-on-write using hardware
> and/or an in-kernel structure?  Ideally copying could be avoided
> /unless/ there is change.  That way each operation looks like an
> atomic commit.
>   

As I think about this, one idea was to add a write-in-progress flag, so 
that the filesystem, or library, or whatever would know not to change 
the page. That would mean that every filesystem would need to be 
enhanced, or that the "safe write" would be optional on a per-filesystem 
level. Implementation of O_DIRECT could do it, or not, and there could 
be a safe way to write.

However, it occurs to me that there are several other levels involved, 
and so it could be better but not perfect. While md could flag the start 
and finish of write, you then need to have the next level, the device 
driver, do the same thing, so md knows when the data need not be frozen. 
"But wait, there's more," as they say, the device driver need to track 
when the data are transferred to the actual device, and the device needs 
to report when the data actually hit the platter, or you could still 
have possible mismatches.

All of that reminds us of the discussion of barriers, and flush cache 
commands, and other performance impacting practices. So in the long run 
I think the most effective solution, one which has the highest 
improvement at the lowest cost in performance, is a copy. Now if Neil 
liked my idea of doing a checksum before and after a write, and a copy 
only in the cases where the data had changed, the impact could be pretty 
small.

All that depends on two things, Neil thinking the whole thing is worth 
doing, and no one finding a flaw in my proposal to do a checksum rather 
than a copy each time.

And to return to your original question, no. Hardware COW works on 
memory pages, a buffer could span pages and a write to a page might not 
be in the part of the page used for the i/o buffer. So as nice as that 
would be, I don't think the hardware supports it. And even if you could, 
the COW needs to be done in the layer which tries to change the buffer, 
so md would set COW and the filesystem would have to deal with it. I am 
pretty sure that's a layering violation, big time. The advisory "write 
in progress" flag might be acceptable, it's information the f/s can use 
or not.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


  reply	other threads:[~2010-02-28  0:01 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
2010-01-22 18:13 ` Goswin von Brederlow
2010-01-24 17:40   ` Jon Hardcastle
2010-01-24 21:52     ` Roger Heflin
2010-01-24 23:13     ` Goswin von Brederlow
2010-01-25 10:07       ` Jon Hardcastle
2010-01-25 10:37         ` Goswin von Brederlow
2010-01-25 10:52           ` Jon Hardcastle
2010-01-25 17:32             ` Goswin von Brederlow
2010-01-25 19:32             ` Iustin Pop
2010-02-01 21:18 ` Bill Davidsen
2010-02-01 22:37   ` Neil Brown
2010-02-02 15:11     ` Bill Davidsen
2010-02-03 11:17       ` Goswin von Brederlow
2010-02-11  5:14       ` Neil Brown
2010-02-11 17:51         ` Bryan Mesich
2010-02-16 21:25           ` Bill Davidsen
2010-02-16 21:38             ` Steven Haigh
2010-02-17  3:19               ` Bryan Mesich
2010-02-17 23:05               ` Neil Brown
2010-02-19 15:18                 ` Piergiorgio Sartor
2010-02-19 22:02                   ` Neil Brown
2010-02-19 22:37                     ` Piergiorgio Sartor
2010-02-19 23:34                     ` Asdo
2010-02-20  4:27                       ` Goswin von Brederlow
2010-02-20 11:12                         ` Asdo
2010-02-21 11:13                           ` Goswin von Brederlow
     [not found]                             ` <8754A21825504719B463AD9809E54349@m5>
     [not found]                               ` <20100221194400.GA2570@lazy.lzy>
2010-02-22 13:01                                 ` Asdo
2010-02-22 13:30                                   ` Piergiorgio Sartor
2010-02-22 13:44                                   ` Piergiorgio Sartor
2010-02-24 19:42                               ` Bill Davidsen
2010-02-20  4:23                     ` Goswin von Brederlow
2010-02-24 14:54                     ` Bill Davidsen
2010-02-24 21:37                       ` Neil Brown
2010-02-26 20:48                         ` Bill Davidsen
2010-02-26 21:09                           ` Neil Brown
2010-02-26 22:01                             ` Piergiorgio Sartor
2010-02-26 22:15                             ` Bill Davidsen
2010-02-26 22:21                               ` Piergiorgio Sartor
2010-02-26 22:20                             ` Asdo
2010-02-27  6:01                               ` Michael Evans
2010-02-28  0:01                                 ` Bill Davidsen [this message]
2010-02-24 14:46                 ` Bill Davidsen
2010-02-24 16:12                   ` Martin K. Petersen
2010-02-24 18:51                     ` Piergiorgio Sartor
2010-02-24 22:21                       ` Neil Brown
2010-02-25  8:41                         ` Piergiorgio Sartor
2010-03-02  4:57                           ` Neil Brown
2010-03-02 18:49                             ` Piergiorgio Sartor
2010-02-24 21:39                     ` Neil Brown
     [not found]                       ` <4B8640A2.4060307@shiftmail.org>
2010-02-25 10:41                         ` Neil Brown
2010-02-28  8:09                       ` Luca Berra
2010-03-02  5:01                         ` Neil Brown
2010-03-02  7:36                           ` Luca Berra
2010-03-02 10:04                             ` Michael Evans
2010-03-02 11:02                               ` Luca Berra
2010-03-02 12:13                                 ` Michael Evans
2010-03-02 18:14                                 ` Asdo
2010-03-02 18:52                                   ` Piergiorgio Sartor
2010-03-02 23:27                                     ` Asdo
2010-03-03  9:13                                       ` Piergiorgio Sartor
2010-03-03 11:42                                         ` Asdo
2010-03-03 12:03                                           ` Piergiorgio Sartor
2010-03-02 20:17                                   ` Neil Brown
2010-02-24 21:32                   ` Neil Brown
2010-02-25  7:22                     ` Goswin von Brederlow
2010-02-25  7:39                       ` Neil Brown
2010-02-25  8:47                     ` John Robinson
2010-02-25  9:07                       ` Neil Brown
2010-02-11 18:12         ` Piergiorgio Sartor
  -- strict thread matches above, loose matches on Subject: below --
2010-02-01 23:14 Jon Hardcastle
2010-01-25 20:43 greg
2010-01-25 22:49 ` Steven Haigh
2010-01-27 21:54   ` Tirumala Reddy Marri
2010-01-28  9:16     ` Jon Hardcastle
2010-01-28 10:29       ` Asdo
2010-01-28 17:20     ` Tirumala Reddy Marri
2010-01-28 18:23       ` Goswin von Brederlow
2010-01-28 19:03         ` Tirumala Reddy Marri
2010-01-28 20:24           ` Goswin von Brederlow
2010-01-29 15:37             ` Jon Hardcastle
2010-01-29 23:52               ` Goswin von Brederlow
2010-01-30 10:39                 ` Jon Hardcastle
2010-02-01 21:10               ` Bill Davidsen
2010-01-20 15:03 Jon Hardcastle
2010-01-20 15:34 ` Brett Russ
2010-01-20 20:44   ` Majed B.
2010-01-20 22:25     ` Brett Russ
2010-01-20 22:30       ` Majed B.
2010-01-20 22:43         ` Brett Russ
2010-01-20 23:01           ` Christopher Chen
2010-01-21  4:17           ` Steven Haigh
2010-01-21  8:08             ` Asdo
2010-01-21 10:52               ` Steven Haigh
2010-01-21 11:48                 ` Farkas Levente
2010-01-21 12:15                   ` Jon Hardcastle
2010-01-19 10:04 Jon Hardcastle
2010-01-20 14:19 ` Brett Russ
2010-01-20 14:34   ` Jon Hardcastle
2010-01-20 14:46     ` Brett Russ
2010-02-01 20:48       ` Bill Davidsen
2010-01-22 16:22   ` Jon Hardcastle
2010-01-22 16:34     ` Asdo
2010-01-22 17:41     ` Brett Russ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B89B250.1060707@tmr.com \
    --to=davidsen@tmr.com \
    --cc=Jon@ehardcastle.com \
    --cc=asdo@shiftmail.org \
    --cc=bryan.mesich@ndsu.edu \
    --cc=linux-raid@vger.kernel.org \
    --cc=mjevans1983@gmail.com \
    --cc=neilb@suse.de \
    --cc=netwiz@crc.id.au \
    --cc=piergiorgio.sartor@nexgo.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).