linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Lsf] Preliminary Agenda and Activities for LSF
       [not found] <1301373398.2590.20.camel@mulgrave.site>
@ 2011-03-29 11:16 ` Ric Wheeler
  2011-03-29 11:22   ` Matthew Wilcox
                     ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Ric Wheeler @ 2011-03-29 11:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: lsf, linux-fsdevel, linux-scsi@vger.kernel.org,
	device-mapper development

On 03/29/2011 12:36 AM, James Bottomley wrote:
> Hi All,
>
> Since LSF is less than a week away, the programme committee put together
> a just in time preliminary agenda for LSF.  As you can see there is
> still plenty of empty space, which you can make suggestions (to this
> list with appropriate general list cc's) for filling:
>
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
>
> If you don't make suggestions, the programme committee will feel
> empowered to make arbitrary assignments based on your topic and attendee
> email requests ...
>
> We're still not quite sure what rooms we will have at the Kabuki, but
> we'll add them to the spreadsheet when we know (they should be close to
> each other).
>
> The spreadsheet above also gives contact information for all the
> attendees and the programme committee.
>
> Yours,
>
> James Bottomley
> on behalf of LSF/MM Programme Committee
>

Here are a few topic ideas:

(1)  The first topic that might span IO & FS tracks (or just pull in device 
mapper people to an FS track) could be adding new commands that would allow 
users to grow/shrink/etc file systems in a generic way.  The thought I had was 
that we have a reasonable model that we could reuse for these new commands like 
mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it 
could be nice to identify exactly what common operations users want to do and 
agree on how to implement them. Alasdair pointed out in the upstream thread that 
we had a prototype here in fsadm.

(2) Very high speed, low latency SSD devices and testing. Have we settled on the 
need for these devices to all have block level drivers? For S-ATA or SAS 
devices, are there known performance issues that require enhancements in 
somewhere in the stack?

(3) The union mount versus overlayfs debate - pros and cons. What each do well, 
what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
in Al's VFS session?)

Thanks!

Ric


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` [Lsf] Preliminary Agenda and Activities for LSF Ric Wheeler
@ 2011-03-29 11:22   ` Matthew Wilcox
  2011-03-29 12:17     ` Jens Axboe
  2011-03-29 17:20   ` Shyam_Iyer
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2011-03-29 11:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org,
	device-mapper development

On Tue, Mar 29, 2011 at 07:16:32AM -0400, Ric Wheeler wrote:
> (2) Very high speed, low latency SSD devices and testing. Have we settled 
> on the need for these devices to all have block level drivers? For S-ATA 
> or SAS devices, are there known performance issues that require 
> enhancements in somewhere in the stack?

I can throw together a quick presentation on this topic.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:22   ` Matthew Wilcox
@ 2011-03-29 12:17     ` Jens Axboe
  2011-03-29 13:09       ` Martin K. Petersen
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2011-03-29 12:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf, linux-fsdevel, device-mapper development, Ric Wheeler,
	linux-scsi@vger.kernel.org

On 2011-03-29 13:22, Matthew Wilcox wrote:
> On Tue, Mar 29, 2011 at 07:16:32AM -0400, Ric Wheeler wrote:
>> (2) Very high speed, low latency SSD devices and testing. Have we settled 
>> on the need for these devices to all have block level drivers? For S-ATA 
>> or SAS devices, are there known performance issues that require 
>> enhancements in somewhere in the stack?
> 
> I can throw together a quick presentation on this topic.

I'll join that too.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 12:17     ` Jens Axboe
@ 2011-03-29 13:09       ` Martin K. Petersen
  2011-03-29 13:12         ` Ric Wheeler
  2011-03-29 13:38         ` James Bottomley
  0 siblings, 2 replies; 43+ messages in thread
From: Martin K. Petersen @ 2011-03-29 13:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, lsf, linux-fsdevel, device-mapper development,
	Ric Wheeler, linux-scsi@vger.kernel.org

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

>> I can throw together a quick presentation on this topic.

Jens> I'll join that too.

Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
cover what's going on with the SCSI over PCIe efforts...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 13:09       ` Martin K. Petersen
@ 2011-03-29 13:12         ` Ric Wheeler
  2011-03-29 13:38         ` James Bottomley
  1 sibling, 0 replies; 43+ messages in thread
From: Ric Wheeler @ 2011-03-29 13:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, linux-scsi@vger.kernel.org, lsf,
	device-mapper development, linux-fsdevel, Ric Wheeler

On 03/29/2011 09:09 AM, Martin K. Petersen wrote:
>
> Jens>  I'll join that too.
>
> Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
> cover what's going on with the SCSI over PCIe efforts...

That sounds interesting to me...

Ric


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 13:09       ` Martin K. Petersen
  2011-03-29 13:12         ` Ric Wheeler
@ 2011-03-29 13:38         ` James Bottomley
  1 sibling, 0 replies; 43+ messages in thread
From: James Bottomley @ 2011-03-29 13:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, Matthew Wilcox, lsf, linux-fsdevel,
	device-mapper development, Ric Wheeler,
	linux-scsi@vger.kernel.org

On Tue, 2011-03-29 at 09:09 -0400, Martin K. Petersen wrote:
> >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> 
> >> I can throw together a quick presentation on this topic.
> 
> Jens> I'll join that too.
> 
> Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll
> cover what's going on with the SCSI over PCIe efforts...

OK, I put you down for a joint sessions with FS and IO after the tea
break on Tuesday.

James




^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` [Lsf] Preliminary Agenda and Activities for LSF Ric Wheeler
  2011-03-29 11:22   ` Matthew Wilcox
@ 2011-03-29 17:20   ` Shyam_Iyer
  2011-03-29 17:33     ` Vivek Goyal
  2011-03-29 19:47   ` Nicholas A. Bellinger
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: Shyam_Iyer @ 2011-03-29 17:20 UTC (permalink / raw)
  To: rwheeler, James.Bottomley; +Cc: lsf, linux-fsdevel, linux-scsi, dm-devel



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, March 29, 2011 7:17 AM
> To: James Bottomley
> Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> scsi@vger.kernel.org; device-mapper development
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put
> together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and
> attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close
> to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in
> device
> mapper people to an FS track) could be adding new commands that would
> allow
> users to grow/shrink/etc file systems in a generic way.  The thought I
> had was
> that we have a reasonable model that we could reuse for these new
> commands like
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> road, it
> could be nice to identify exactly what common operations users want to
> do and
> agree on how to implement them. Alasdair pointed out in the upstream
> thread that
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we
> settled on the
> need for these devices to all have block level drivers? For S-ATA or
> SAS
> devices, are there known performance issues that require enhancements
> in
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each
> do well,
> what needs doing. Do we want/need both upstream? (Maybe this can get 10
> minutes
> in Al's VFS session?)
> 
> Thanks!
> 
> Ric

A few others that I think may span across I/O, Block fs..layers.

1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile. Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil.
2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices.
3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other
4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem.

-Shyam

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 17:20   ` Shyam_Iyer
@ 2011-03-29 17:33     ` Vivek Goyal
  2011-03-29 18:10       ` Shyam_Iyer
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-03-29 17:33 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel,
	linux-scsi

On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > Sent: Tuesday, March 29, 2011 7:17 AM
> > To: James Bottomley
> > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > scsi@vger.kernel.org; device-mapper development
> > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > 
> > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > Hi All,
> > >
> > > Since LSF is less than a week away, the programme committee put
> > together
> > > a just in time preliminary agenda for LSF.  As you can see there is
> > > still plenty of empty space, which you can make suggestions (to this
> > > list with appropriate general list cc's) for filling:
> > >
> > >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > >
> > > If you don't make suggestions, the programme committee will feel
> > > empowered to make arbitrary assignments based on your topic and
> > attendee
> > > email requests ...
> > >
> > > We're still not quite sure what rooms we will have at the Kabuki, but
> > > we'll add them to the spreadsheet when we know (they should be close
> > to
> > > each other).
> > >
> > > The spreadsheet above also gives contact information for all the
> > > attendees and the programme committee.
> > >
> > > Yours,
> > >
> > > James Bottomley
> > > on behalf of LSF/MM Programme Committee
> > >
> > 
> > Here are a few topic ideas:
> > 
> > (1)  The first topic that might span IO & FS tracks (or just pull in
> > device
> > mapper people to an FS track) could be adding new commands that would
> > allow
> > users to grow/shrink/etc file systems in a generic way.  The thought I
> > had was
> > that we have a reasonable model that we could reuse for these new
> > commands like
> > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > road, it
> > could be nice to identify exactly what common operations users want to
> > do and
> > agree on how to implement them. Alasdair pointed out in the upstream
> > thread that
> > we had a prototype here in fsadm.
> > 
> > (2) Very high speed, low latency SSD devices and testing. Have we
> > settled on the
> > need for these devices to all have block level drivers? For S-ATA or
> > SAS
> > devices, are there known performance issues that require enhancements
> > in
> > somewhere in the stack?
> > 
> > (3) The union mount versus overlayfs debate - pros and cons. What each
> > do well,
> > what needs doing. Do we want/need both upstream? (Maybe this can get 10
> > minutes
> > in Al's VFS session?)
> > 
> > Thanks!
> > 
> > Ric
> 
> A few others that I think may span across I/O, Block fs..layers.
> 
> 1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile.

> Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil.

Is above any different from block IO throttling we have got for block
devices?

> 2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices.
> 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other
> 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem.

Above is pretty generic. Do you have specific needs/ideas/concerns?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 17:33     ` Vivek Goyal
@ 2011-03-29 18:10       ` Shyam_Iyer
  2011-03-29 18:45         ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Shyam_Iyer @ 2011-03-29 18:10 UTC (permalink / raw)
  To: vgoyal; +Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel,
	linux-scsi



> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@redhat.com]
> Sent: Tuesday, March 29, 2011 1:34 PM
> To: Iyer, Shyam
> Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> devel@redhat.com; linux-scsi@vger.kernel.org
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> >
> >
> > > -----Original Message-----
> > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > To: James Bottomley
> > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > scsi@vger.kernel.org; device-mapper development
> > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > >
> > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > Hi All,
> > > >
> > > > Since LSF is less than a week away, the programme committee put
> > > together
> > > > a just in time preliminary agenda for LSF.  As you can see there
> is
> > > > still plenty of empty space, which you can make suggestions (to
> this
> > > > list with appropriate general list cc's) for filling:
> > > >
> > > >
> > >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > >
> > > > If you don't make suggestions, the programme committee will feel
> > > > empowered to make arbitrary assignments based on your topic and
> > > attendee
> > > > email requests ...
> > > >
> > > > We're still not quite sure what rooms we will have at the Kabuki,
> but
> > > > we'll add them to the spreadsheet when we know (they should be
> close
> > > to
> > > > each other).
> > > >
> > > > The spreadsheet above also gives contact information for all the
> > > > attendees and the programme committee.
> > > >
> > > > Yours,
> > > >
> > > > James Bottomley
> > > > on behalf of LSF/MM Programme Committee
> > > >
> > >
> > > Here are a few topic ideas:
> > >
> > > (1)  The first topic that might span IO & FS tracks (or just pull
> in
> > > device
> > > mapper people to an FS track) could be adding new commands that
> would
> > > allow
> > > users to grow/shrink/etc file systems in a generic way.  The
> thought I
> > > had was
> > > that we have a reasonable model that we could reuse for these new
> > > commands like
> > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > > road, it
> > > could be nice to identify exactly what common operations users want
> to
> > > do and
> > > agree on how to implement them. Alasdair pointed out in the
> upstream
> > > thread that
> > > we had a prototype here in fsadm.
> > >
> > > (2) Very high speed, low latency SSD devices and testing. Have we
> > > settled on the
> > > need for these devices to all have block level drivers? For S-ATA
> or
> > > SAS
> > > devices, are there known performance issues that require
> enhancements
> > > in
> > > somewhere in the stack?
> > >
> > > (3) The union mount versus overlayfs debate - pros and cons. What
> each
> > > do well,
> > > what needs doing. Do we want/need both upstream? (Maybe this can
> get 10
> > > minutes
> > > in Al's VFS session?)
> > >
> > > Thanks!
> > >
> > > Ric
> >
> > A few others that I think may span across I/O, Block fs..layers.
> >
> > 1) Dm-thinp target vs File system thin profile vs block map based
> thin/trim profile.
> 
> > Facilitate I/O throttling for thin/trimmable storage. Online and
> Offline profil.
> 
> Is above any different from block IO throttling we have got for block
> devices?
> 
Yes.. so the throttling would be capacity  based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O.


> > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters
> floating around in sysfs, procfs. Architecting guidelines for accepting
> patches for hybrid devices.
> > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for
> all and they have to help each other

For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too..

If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent.

The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. 

> > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your
> subsystem and there are many non-cooperating B/W control constructs in
> each subsystem.
> 
> Above is pretty generic. Do you have specific needs/ideas/concerns?
> 
> Thanks
> Vivek
Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver.

The TC classes route the network I/O to multiqueue groups and so theoretically you could have block queues 1:1 with the number of network multiqueues..

-Shyam

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 18:10       ` Shyam_Iyer
@ 2011-03-29 18:45         ` Vivek Goyal
  2011-03-29 19:13           ` Shyam_Iyer
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-03-29 18:45 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel,
	linux-scsi

On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: Vivek Goyal [mailto:vgoyal@redhat.com]
> > Sent: Tuesday, March 29, 2011 1:34 PM
> > To: Iyer, Shyam
> > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> > devel@redhat.com; linux-scsi@vger.kernel.org
> > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > 
> > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > > To: James Bottomley
> > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > > scsi@vger.kernel.org; device-mapper development
> > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > > >
> > > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > > Hi All,
> > > > >
> > > > > Since LSF is less than a week away, the programme committee put
> > > > together
> > > > > a just in time preliminary agenda for LSF.  As you can see there
> > is
> > > > > still plenty of empty space, which you can make suggestions (to
> > this
> > > > > list with appropriate general list cc's) for filling:
> > > > >
> > > > >
> > > >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > > >
> > > > > If you don't make suggestions, the programme committee will feel
> > > > > empowered to make arbitrary assignments based on your topic and
> > > > attendee
> > > > > email requests ...
> > > > >
> > > > > We're still not quite sure what rooms we will have at the Kabuki,
> > but
> > > > > we'll add them to the spreadsheet when we know (they should be
> > close
> > > > to
> > > > > each other).
> > > > >
> > > > > The spreadsheet above also gives contact information for all the
> > > > > attendees and the programme committee.
> > > > >
> > > > > Yours,
> > > > >
> > > > > James Bottomley
> > > > > on behalf of LSF/MM Programme Committee
> > > > >
> > > >
> > > > Here are a few topic ideas:
> > > >
> > > > (1)  The first topic that might span IO & FS tracks (or just pull
> > in
> > > > device
> > > > mapper people to an FS track) could be adding new commands that
> > would
> > > > allow
> > > > users to grow/shrink/etc file systems in a generic way.  The
> > thought I
> > > > had was
> > > > that we have a reasonable model that we could reuse for these new
> > > > commands like
> > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the
> > > > road, it
> > > > could be nice to identify exactly what common operations users want
> > to
> > > > do and
> > > > agree on how to implement them. Alasdair pointed out in the
> > upstream
> > > > thread that
> > > > we had a prototype here in fsadm.
> > > >
> > > > (2) Very high speed, low latency SSD devices and testing. Have we
> > > > settled on the
> > > > need for these devices to all have block level drivers? For S-ATA
> > or
> > > > SAS
> > > > devices, are there known performance issues that require
> > enhancements
> > > > in
> > > > somewhere in the stack?
> > > >
> > > > (3) The union mount versus overlayfs debate - pros and cons. What
> > each
> > > > do well,
> > > > what needs doing. Do we want/need both upstream? (Maybe this can
> > get 10
> > > > minutes
> > > > in Al's VFS session?)
> > > >
> > > > Thanks!
> > > >
> > > > Ric
> > >
> > > A few others that I think may span across I/O, Block fs..layers.
> > >
> > > 1) Dm-thinp target vs File system thin profile vs block map based
> > thin/trim profile.
> > 
> > > Facilitate I/O throttling for thin/trimmable storage. Online and
> > Offline profil.
> > 
> > Is above any different from block IO throttling we have got for block
> > devices?
> > 
> Yes.. so the throttling would be capacity  based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O.
> 

Sounds like some user space daemon listening for these events and then
modifying cgroup throttling limits dynamically?

> 
> > > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters
> > floating around in sysfs, procfs. Architecting guidelines for accepting
> > patches for hybrid devices.
> > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for
> > all and they have to help each other
> 
> For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too..
> 
> If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent.
> 
> The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. 
> 
> > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your
> > subsystem and there are many non-cooperating B/W control constructs in
> > each subsystem.
> > 
> > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > 
> > Thanks
> > Vivek
> Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver.

So if IO is going over network and actual bandwidth control is taking
place by throttling ethernet traffic then one does not have to specify
block cgroup throttling policy and hence no need for cgroups to be worried
about ethernet driver events?

I think I am missing something here.

Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 18:45         ` Vivek Goyal
@ 2011-03-29 19:13           ` Shyam_Iyer
  2011-03-29 19:57             ` Vivek Goyal
  2011-03-29 19:59             ` Mike Snitzer
  0 siblings, 2 replies; 43+ messages in thread
From: Shyam_Iyer @ 2011-03-29 19:13 UTC (permalink / raw)
  To: vgoyal; +Cc: lsf, linux-scsi, dm-devel, linux-fsdevel, rwheeler



> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@redhat.com]
> Sent: Tuesday, March 29, 2011 2:45 PM
> To: Iyer, Shyam
> Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> devel@redhat.com; linux-scsi@vger.kernel.org
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote:
> >
> >
> > > -----Original Message-----
> > > From: Vivek Goyal [mailto:vgoyal@redhat.com]
> > > Sent: Tuesday, March 29, 2011 1:34 PM
> > > To: Iyer, Shyam
> > > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com;
> > > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm-
> > > devel@redhat.com; linux-scsi@vger.kernel.org
> > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > >
> > > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com
> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler
> > > > > Sent: Tuesday, March 29, 2011 7:17 AM
> > > > > To: James Bottomley
> > > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux-
> > > > > scsi@vger.kernel.org; device-mapper development
> > > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> > > > >
> > > > > On 03/29/2011 12:36 AM, James Bottomley wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > Since LSF is less than a week away, the programme committee
> put
> > > > > together
> > > > > > a just in time preliminary agenda for LSF.  As you can see
> there
> > > is
> > > > > > still plenty of empty space, which you can make suggestions
> (to
> > > this
> > > > > > list with appropriate general list cc's) for filling:
> > > > > >
> > > > > >
> > > > >
> > >
> https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz
> > > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> > > > > >
> > > > > > If you don't make suggestions, the programme committee will
> feel
> > > > > > empowered to make arbitrary assignments based on your topic
> and
> > > > > attendee
> > > > > > email requests ...
> > > > > >
> > > > > > We're still not quite sure what rooms we will have at the
> Kabuki,
> > > but
> > > > > > we'll add them to the spreadsheet when we know (they should
> be
> > > close
> > > > > to
> > > > > > each other).
> > > > > >
> > > > > > The spreadsheet above also gives contact information for all
> the
> > > > > > attendees and the programme committee.
> > > > > >
> > > > > > Yours,
> > > > > >
> > > > > > James Bottomley
> > > > > > on behalf of LSF/MM Programme Committee
> > > > > >
> > > > >
> > > > > Here are a few topic ideas:
> > > > >
> > > > > (1)  The first topic that might span IO & FS tracks (or just
> pull
> > > in
> > > > > device
> > > > > mapper people to an FS track) could be adding new commands that
> > > would
> > > > > allow
> > > > > users to grow/shrink/etc file systems in a generic way.  The
> > > thought I
> > > > > had was
> > > > > that we have a reasonable model that we could reuse for these
> new
> > > > > commands like
> > > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down
> the
> > > > > road, it
> > > > > could be nice to identify exactly what common operations users
> want
> > > to
> > > > > do and
> > > > > agree on how to implement them. Alasdair pointed out in the
> > > upstream
> > > > > thread that
> > > > > we had a prototype here in fsadm.
> > > > >
> > > > > (2) Very high speed, low latency SSD devices and testing. Have
> we
> > > > > settled on the
> > > > > need for these devices to all have block level drivers? For S-
> ATA
> > > or
> > > > > SAS
> > > > > devices, are there known performance issues that require
> > > enhancements
> > > > > in
> > > > > somewhere in the stack?
> > > > >
> > > > > (3) The union mount versus overlayfs debate - pros and cons.
> What
> > > each
> > > > > do well,
> > > > > what needs doing. Do we want/need both upstream? (Maybe this
> can
> > > get 10
> > > > > minutes
> > > > > in Al's VFS session?)
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Ric
> > > >
> > > > A few others that I think may span across I/O, Block fs..layers.
> > > >
> > > > 1) Dm-thinp target vs File system thin profile vs block map based
> > > thin/trim profile.
> > >
> > > > Facilitate I/O throttling for thin/trimmable storage. Online and
> > > Offline profil.
> > >
> > > Is above any different from block IO throttling we have got for
> block
> > > devices?
> > >
> > Yes.. so the throttling would be capacity  based.. when the storage
> array wants us to throttle the I/O. Depending on the event we may keep
> getting space allocation write protect check conditions for writes
> until a user intervenes to stop I/O.
> >
> 
> Sounds like some user space daemon listening for these events and then
> modifying cgroup throttling limits dynamically?

But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits.

The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. 

That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy..

The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events.

This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem..
> 
> >
> > > > 2) Interfaces for SCSI, Ethernet/*transport configuration
> parameters
> > > floating around in sysfs, procfs. Architecting guidelines for
> accepting
> > > patches for hybrid devices.
> > > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room
> for
> > > all and they have to help each other
> >
> > For instance if you took a DM snapshot and the storage sent a check
> condition to the original dm device I am not sure if the DM snapshot
> would get one too..
> >
> > If you had a scenario of taking H/W snapshot of an entire pool and
> decide to delete the individual DM snapshots the H/W snapshot would be
> inconsistent.
> >
> > The blocks being managed by a DM-device would have moved (SCSI
> referrals). I believe Hannes is working on the referrals piece..
> >
> > > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick
> your
> > > subsystem and there are many non-cooperating B/W control constructs
> in
> > > each subsystem.
> > >
> > > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > >
> > > Thanks
> > > Vivek
> > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O
> b/w via cgroups. Such bandwidth manipulations are network switch driven
> and cgroups never take care of these events from the Ethernet driver.
> 
> So if IO is going over network and actual bandwidth control is taking
> place by throttling ethernet traffic then one does not have to specify
> block cgroup throttling policy and hence no need for cgroups to be
> worried
> about ethernet driver events?
> 
> I think I am missing something here.
> 
> Vivek
Well.. here is the catch.. example scenario..

- Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.

- The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1

The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 

Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..

Policies are usually decided at different levels, SLAs and sometimes logistics determine these decisions etc. Sometimes the bandwidth lowering by the switch is traffic dependent but user level policies remain in tact. Typical case of network administrator not talking to the system administrator.

-Shyam

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` [Lsf] Preliminary Agenda and Activities for LSF Ric Wheeler
  2011-03-29 11:22   ` Matthew Wilcox
  2011-03-29 17:20   ` Shyam_Iyer
@ 2011-03-29 19:47   ` Nicholas A. Bellinger
  2011-03-29 20:29   ` Jan Kara
  2011-03-30  0:33   ` Mingming Cao
  4 siblings, 0 replies; 43+ messages in thread
From: Nicholas A. Bellinger @ 2011-03-29 19:47 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org,
	device-mapper development

On Tue, 2011-03-29 at 07:16 -0400, Ric Wheeler wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in device 
> mapper people to an FS track) could be adding new commands that would allow 
> users to grow/shrink/etc file systems in a generic way.  The thought I had was 
> that we have a reasonable model that we could reuse for these new commands like 
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it 
> could be nice to identify exactly what common operations users want to do and 
> agree on how to implement them. Alasdair pointed out in the upstream thread that 
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we settled on the 
> need for these devices to all have block level drivers? For S-ATA or SAS 
> devices, are there known performance issues that require enhancements in 
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each do well, 
> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
> in Al's VFS session?)
> 

Hi Ric, James and LSF-PC chairs,

Beyond my original LSF topic proposal for the next-generation QEMU/KVM
Virtio-SCSI target driver here:

http://marc.info/?l=linux-scsi&m=129706545408966&w=2

The following target mode related topics would be useful for the current
attendees with interest in /drivers/target/ code if there is extra room
available for local attendance within the IO/storage track.

(4) Enabling mixed Target/Initiator mode in existing mainline SCSI LLDs
that support HW target mode, and come to an consensus determination for
how best to make the SCSI LLD / target fabric driver split when enabling
mainline target infrastructure support into existing SCSI LLDs.  This
code is currently in flight for qla2xxx / tcm_qla2xxx for .40  (Hannes,
Christoph, Mike, Qlogic and other LLD maintainers)

(5) Driving target configfs group creation from kernel-space via a
userspace passthrough using some form of portable / acceptable mainline
interface.  This is a topic that has been raised on the scsi list for
the ibmvscsis target driver for .40, and is going to be useful for other
in-flight HW target driver as well. (Tomo-san, Hannes, Mike, James,
Joel)

Thank you!

--nab


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 19:13           ` Shyam_Iyer
@ 2011-03-29 19:57             ` Vivek Goyal
  2011-03-29 19:59             ` Mike Snitzer
  1 sibling, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2011-03-29 19:57 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel,
	linux-scsi

On Tue, Mar 29, 2011 at 12:13:41PM -0700, Shyam_Iyer@Dell.com wrote:

[..]
> > 
> > Sounds like some user space daemon listening for these events and then
> > modifying cgroup throttling limits dynamically?
> 
> But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits.
> 
> The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. 

Well, if there is one cgroup (root cgroup), then daemon does not have to
find anything. This is one global space and there is provision to set
per device limit. So daemon can just go and adjust device limits
dynamically and that gets applicable for all processes.

The problem will happen if there are more cgroups created and limits are
per cgroup, per device. (For creating service differentiation). I would
say in that case daemon needs to be more sophisticated and reduce the
limit in each group by same % as required by thinly provisioned target.

That way a higher rate group will still get higher IO rate on a thinly
provisioned device which is imposing its own throttling. Otherwise we
again run into issues where there is no service differentiation between
faster group or slower group.

IOW, if we are throttling thinly povisioned devices, I think throttling
these using a user space daemon might be better as it will reuse the
kernel throttling infrastructure as well as throttling will be cgroup
aware.
 
> 
> That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy..
> 
> The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events.
> 
> This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem..

Even current blkio throttling does not work over NFS. This is one of the
issues I wanted to discuss at LSF.

[..]
> Well.. here is the catch.. example scenario..
> 
> - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.
> 
> - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1
> 
> The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 
> 
> Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..
> 

So we have multipathed two paths in a round robin manner and one path is
faster and other is slower. I am not sure what multipath does in those
scenarios but trying to send more IO on faster path sounds like right
thing to do.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Preliminary Agenda and Activities for LSF
  2011-03-29 19:13           ` Shyam_Iyer
  2011-03-29 19:57             ` Vivek Goyal
@ 2011-03-29 19:59             ` Mike Snitzer
  2011-03-29 20:12               ` Shyam_Iyer
  1 sibling, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2011-03-29 19:59 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler,
	device-mapper development

On Tue, Mar 29 2011 at  3:13pm -0400,
Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:

> > > > Above is pretty generic. Do you have specific needs/ideas/concerns?
> > > >
> > > > Thanks
> > > > Vivek
> > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O
> > b/w via cgroups. Such bandwidth manipulations are network switch driven
> > and cgroups never take care of these events from the Ethernet driver.
> > 
> > So if IO is going over network and actual bandwidth control is taking
> > place by throttling ethernet traffic then one does not have to specify
> > block cgroup throttling policy and hence no need for cgroups to be
> > worried
> > about ethernet driver events?
> > 
> > I think I am missing something here.
> > 
> > Vivek
> Well.. here is the catch.. example scenario..
> 
> - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1  multipathed together. Let us say round-robin policy.
> 
> - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1
> 
> The computation that the bandwidth configured is 40% of the available bandwidth is false in this case.  What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. 
> 
> Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint..

No hint should be needed.  Just use one of the newer multipath path
selectors that are dynamic by design: "queue-length" or "service-time".

This scenario is exactly what those path selectors are meant to address.

Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: Preliminary Agenda and Activities for LSF
  2011-03-29 19:59             ` Mike Snitzer
@ 2011-03-29 20:12               ` Shyam_Iyer
  2011-03-29 20:23                 ` Mike Snitzer
  0 siblings, 1 reply; 43+ messages in thread
From: Shyam_Iyer @ 2011-03-29 20:12 UTC (permalink / raw)
  To: snitzer; +Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler, dm-devel



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Tuesday, March 29, 2011 4:00 PM
> To: Iyer, Shyam
> Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> rwheeler@redhat.com; device-mapper development
> Subject: Re: Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29 2011 at  3:13pm -0400,
> Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> 
> > > > > Above is pretty generic. Do you have specific
> needs/ideas/concerns?
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit
> I/O
> > > b/w via cgroups. Such bandwidth manipulations are network switch
> driven
> > > and cgroups never take care of these events from the Ethernet
> driver.
> > >
> > > So if IO is going over network and actual bandwidth control is
> taking
> > > place by throttling ethernet traffic then one does not have to
> specify
> > > block cgroup throttling policy and hence no need for cgroups to be
> > > worried
> > > about ethernet driver events?
> > >
> > > I think I am missing something here.
> > >
> > > Vivek
> > Well.. here is the catch.. example scenario..
> >
> > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> multipathed together. Let us say round-robin policy.
> >
> > - The cgroup profile is to limit I/O bandwidth to 40% of the
> multipathed I/O bandwidth. But the switch may have limited the I/O
> bandwidth to 40% for the corresponding vlan associated with one of the
> eth interface say eth1
> >
> > The computation that the bandwidth configured is 40% of the available
> bandwidth is false in this case.  What we need to do is possibly push
> more I/O through eth0 as it is allowed to run at 100% of bandwidth by
> the switch.
> >
> > Now this is a dynamic decision and multipathing layer should take
> care of it.. but it would need a hint..
> 
> No hint should be needed.  Just use one of the newer multipath path
> selectors that are dynamic by design: "queue-length" or "service-time".
> 
> This scenario is exactly what those path selectors are meant to
> address.
> 
> Mike

Since iSCSI multipaths are essentially sessions one could configure more than one session through the same ethX interface. The sessions need not be going to the same LUN and hence not governed by the same multipath selector but the bandwidth policy group would be for a group of resources.

-Shyam





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Preliminary Agenda and Activities for LSF
  2011-03-29 20:12               ` Shyam_Iyer
@ 2011-03-29 20:23                 ` Mike Snitzer
  2011-03-29 23:09                   ` Shyam_Iyer
  0 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2011-03-29 20:23 UTC (permalink / raw)
  To: Shyam_Iyer
  Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal,
	device-mapper development

On Tue, Mar 29 2011 at  4:12pm -0400,
Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:

> 
> 
> > -----Original Message-----
> > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > Sent: Tuesday, March 29, 2011 4:00 PM
> > To: Iyer, Shyam
> > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> > rwheeler@redhat.com; device-mapper development
> > Subject: Re: Preliminary Agenda and Activities for LSF
> > 
> > On Tue, Mar 29 2011 at  3:13pm -0400,
> > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> > 
> > > > > > Above is pretty generic. Do you have specific
> > needs/ideas/concerns?
> > > > > >
> > > > > > Thanks
> > > > > > Vivek
> > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit
> > I/O
> > > > b/w via cgroups. Such bandwidth manipulations are network switch
> > driven
> > > > and cgroups never take care of these events from the Ethernet
> > driver.
> > > >
> > > > So if IO is going over network and actual bandwidth control is
> > taking
> > > > place by throttling ethernet traffic then one does not have to
> > specify
> > > > block cgroup throttling policy and hence no need for cgroups to be
> > > > worried
> > > > about ethernet driver events?
> > > >
> > > > I think I am missing something here.
> > > >
> > > > Vivek
> > > Well.. here is the catch.. example scenario..
> > >
> > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> > multipathed together. Let us say round-robin policy.
> > >
> > > - The cgroup profile is to limit I/O bandwidth to 40% of the
> > multipathed I/O bandwidth. But the switch may have limited the I/O
> > bandwidth to 40% for the corresponding vlan associated with one of the
> > eth interface say eth1
> > >
> > > The computation that the bandwidth configured is 40% of the available
> > bandwidth is false in this case.  What we need to do is possibly push
> > more I/O through eth0 as it is allowed to run at 100% of bandwidth by
> > the switch.
> > >
> > > Now this is a dynamic decision and multipathing layer should take
> > care of it.. but it would need a hint..
> > 
> > No hint should be needed.  Just use one of the newer multipath path
> > selectors that are dynamic by design: "queue-length" or "service-time".
> > 
> > This scenario is exactly what those path selectors are meant to
> > address.
> > 
> > Mike
> 
> Since iSCSI multipaths are essentially sessions one could configure
> more than one session through the same ethX interface. The sessions
> need not be going to the same LUN and hence not governed by the same
> multipath selector but the bandwidth policy group would be for a group
> of resources.

Then the sessions don't correspond to the same backend LUN (and by
definition aren't part of the same mpath device).  You're really all
over the map with your talking points.

I'm having a hard time following you.

Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` [Lsf] Preliminary Agenda and Activities for LSF Ric Wheeler
                     ` (2 preceding siblings ...)
  2011-03-29 19:47   ` Nicholas A. Bellinger
@ 2011-03-29 20:29   ` Jan Kara
  2011-03-29 20:31     ` Ric Wheeler
  2011-03-30  0:33   ` Mingming Cao
  4 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2011-03-29 20:29 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, device-mapper development,
	linux-scsi@vger.kernel.org

On Tue 29-03-11 07:16:32, Ric Wheeler wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> (3) The union mount versus overlayfs debate - pros and cons. What each do well, 
> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
> in Al's VFS session?)
  It might be interesting but neither Miklos nor Val seems to be attending
so I'm not sure how deep discussion we can have :).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 20:29   ` Jan Kara
@ 2011-03-29 20:31     ` Ric Wheeler
  0 siblings, 0 replies; 43+ messages in thread
From: Ric Wheeler @ 2011-03-29 20:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ric Wheeler, James Bottomley, lsf, device-mapper development,
	linux-fsdevel, linux-scsi@vger.kernel.org

On 03/29/2011 04:29 PM, Jan Kara wrote:
> On Tue 29-03-11 07:16:32, Ric Wheeler wrote:
>> On 03/29/2011 12:36 AM, James Bottomley wrote:
>> (3) The union mount versus overlayfs debate - pros and cons. What each do well,
>> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes
>> in Al's VFS session?)
>    It might be interesting but neither Miklos nor Val seems to be attending
> so I'm not sure how deep discussion we can have :).
>
> 								Honza

Very true - probably best to keep that discussion focused upstream (but that 
seems to have quieted down as well)...

Ric


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: Preliminary Agenda and Activities for LSF
  2011-03-29 20:23                 ` Mike Snitzer
@ 2011-03-29 23:09                   ` Shyam_Iyer
  2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
  0 siblings, 1 reply; 43+ messages in thread
From: Shyam_Iyer @ 2011-03-29 23:09 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal, dm-devel



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Tuesday, March 29, 2011 4:24 PM
> To: Iyer, Shyam
> Cc: linux-scsi@vger.kernel.org; lsf@lists.linux-foundation.org; linux-
> fsdevel@vger.kernel.org; rwheeler@redhat.com; vgoyal@redhat.com;
> device-mapper development
> Subject: Re: Preliminary Agenda and Activities for LSF
> 
> On Tue, Mar 29 2011 at  4:12pm -0400,
> Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > > Sent: Tuesday, March 29, 2011 4:00 PM
> > > To: Iyer, Shyam
> > > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux-
> > > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> > > rwheeler@redhat.com; device-mapper development
> > > Subject: Re: Preliminary Agenda and Activities for LSF
> > >
> > > On Tue, Mar 29 2011 at  3:13pm -0400,
> > > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote:
> > >
> > > > > > > Above is pretty generic. Do you have specific
> > > needs/ideas/concerns?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vivek
> > > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to
> limit
> > > I/O
> > > > > b/w via cgroups. Such bandwidth manipulations are network
> switch
> > > driven
> > > > > and cgroups never take care of these events from the Ethernet
> > > driver.
> > > > >
> > > > > So if IO is going over network and actual bandwidth control is
> > > taking
> > > > > place by throttling ethernet traffic then one does not have to
> > > specify
> > > > > block cgroup throttling policy and hence no need for cgroups to
> be
> > > > > worried
> > > > > about ethernet driver events?
> > > > >
> > > > > I think I am missing something here.
> > > > >
> > > > > Vivek
> > > > Well.. here is the catch.. example scenario..
> > > >
> > > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1
> > > multipathed together. Let us say round-robin policy.
> > > >
> > > > - The cgroup profile is to limit I/O bandwidth to 40% of the
> > > multipathed I/O bandwidth. But the switch may have limited the I/O
> > > bandwidth to 40% for the corresponding vlan associated with one of
> the
> > > eth interface say eth1
> > > >
> > > > The computation that the bandwidth configured is 40% of the
> available
> > > bandwidth is false in this case.  What we need to do is possibly
> push
> > > more I/O through eth0 as it is allowed to run at 100% of bandwidth
> by
> > > the switch.
> > > >
> > > > Now this is a dynamic decision and multipathing layer should take
> > > care of it.. but it would need a hint..
> > >
> > > No hint should be needed.  Just use one of the newer multipath path
> > > selectors that are dynamic by design: "queue-length" or "service-
> time".
> > >
> > > This scenario is exactly what those path selectors are meant to
> > > address.
> > >
> > > Mike
> >
> > Since iSCSI multipaths are essentially sessions one could configure
> > more than one session through the same ethX interface. The sessions
> > need not be going to the same LUN and hence not governed by the same
> > multipath selector but the bandwidth policy group would be for a
> group
> > of resources.
> 
> Then the sessions don't correspond to the same backend LUN (and by
> definition aren't part of the same mpath device).  You're really all
> over the map with your talking points.
> 
> I'm having a hard time following you.
> 
> Mike

Let me back up here.. this has to be thought in not only the traditional Ethernet sense but also in a Data Centre Bridged environment. I shouldn't have wandered into the multipath constructs..

I think the statement on not going to the same LUN was a little erroneous. I meant different /dev/sdXs.. and hence different block I/O queues.

Each I/O queue could be thought of as a bandwidth queue class being serviced through a corresponding network adapter's queue(assuming a multiqueue capable adapter)

Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group corresponding to a weightage of 20% of the I/O bandwidth the user has configured this weight thinking that this will correspond to say 200Mb of bandwidth.

Let us say the network bandwidth on the corresponding network queues corresponding was reduced by the DCB capable switch...
We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.

In such a scenario the option is to move I/O to a different bandwidth priority queue in the network adapter. This could be moving I/O to a new network queue in eth0 or another queue in eth1 .. 

This requires mapping the block queue to the new network queue.

One way of solving this is what is getting into the open-iscsi world i.e. creating a session tagged to the relevant DCB priority and thus the session gets mapped to the relevant tc queue which ultimately maps to one of the network adapters multiqueue..

But when multipath fails over to the different session path then the DCB bandwidth priority will not move with it..

Ok one could argue that is a user mistake to have configured bandwidth priorities differently but it may so happen that the bandwidth priority was just dynamically changed by the switch for the particular queue.

Although I gave an example of a DCB environment but we could definitely look at doing a 1:n map of block queues to network adapter queues for non-DCB environments too..


-Shyam


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 11:16 ` [Lsf] Preliminary Agenda and Activities for LSF Ric Wheeler
                     ` (3 preceding siblings ...)
  2011-03-29 20:29   ` Jan Kara
@ 2011-03-30  0:33   ` Mingming Cao
  2011-03-30  2:17     ` Dave Chinner
  4 siblings, 1 reply; 43+ messages in thread
From: Mingming Cao @ 2011-03-30  0:33 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org,
	device-mapper development

On Tue, 2011-03-29 at 07:16 -0400, Ric Wheeler wrote:
> On 03/29/2011 12:36 AM, James Bottomley wrote:
> > Hi All,
> >
> > Since LSF is less than a week away, the programme committee put together
> > a just in time preliminary agenda for LSF.  As you can see there is
> > still plenty of empty space, which you can make suggestions (to this
> > list with appropriate general list cc's) for filling:
> >
> > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html
> >
> > If you don't make suggestions, the programme committee will feel
> > empowered to make arbitrary assignments based on your topic and attendee
> > email requests ...
> >
> > We're still not quite sure what rooms we will have at the Kabuki, but
> > we'll add them to the spreadsheet when we know (they should be close to
> > each other).
> >
> > The spreadsheet above also gives contact information for all the
> > attendees and the programme committee.
> >
> > Yours,
> >
> > James Bottomley
> > on behalf of LSF/MM Programme Committee
> >
> 
> Here are a few topic ideas:
> 
> (1)  The first topic that might span IO & FS tracks (or just pull in device 
> mapper people to an FS track) could be adding new commands that would allow 
> users to grow/shrink/etc file systems in a generic way.  The thought I had was 
> that we have a reasonable model that we could reuse for these new commands like 
> mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it 
> could be nice to identify exactly what common operations users want to do and 
> agree on how to implement them. Alasdair pointed out in the upstream thread that 
> we had a prototype here in fsadm.
> 
> (2) Very high speed, low latency SSD devices and testing. Have we settled on the 
> need for these devices to all have block level drivers? For S-ATA or SAS 
> devices, are there known performance issues that require enhancements in 
> somewhere in the stack?
> 
> (3) The union mount versus overlayfs debate - pros and cons. What each do well, 
> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes 
> in Al's VFS session?)
> 

Ric,

May I propose some discussion about concurrent direct IO support for
ext4?

Direct IO write are serialized by the single i_mutex lock.  This lock
contention becomes significant when running database or direct IO heavy
workload on guest, where  the host pass a file image to guest as a block
device. All the parallel IOs in guests are being serialized by the
i_mutex lock on the host disk image file. This greatly penalize the data
base application performance in KVM. 

I am looking for some discussion about removing the i_mutex lock in the
direct IO write code path for ext4, when multiple threads are
direct write to different offset of the same file. This would require
some way to track the in-fly DIO IO range, either done at ext4 level or
above th vfs layer. 


Thanks,



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  0:33   ` Mingming Cao
@ 2011-03-30  2:17     ` Dave Chinner
  2011-03-30 11:13       ` Theodore Tso
  2011-03-30 21:49       ` Mingming Cao
  0 siblings, 2 replies; 43+ messages in thread
From: Dave Chinner @ 2011-03-30  2:17 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development

On Tue, Mar 29, 2011 at 05:33:30PM -0700, Mingming Cao wrote:
> Ric,
> 
> May I propose some discussion about concurrent direct IO support for
> ext4?

Just look at the way XFS does it and copy that?  i.e. it has a
filesytem level IO lock and an inode lock both with shared/exclusive
semantics. These lie below the i_mutex (i.e. locking order is
i_mutex, i_iolock, i_ilock), and effectively result in the i_mutex
only being used for VFS level synchronisation and as such is rarely
used inside XFS itself.

Inode attribute operations are protected by the inode lock, while IO
operations and truncation synchronisation is provided by the IO
lock.

So for buffered IO, the IO lock is used in shared mode for reads
and exclusive mode for writes. This gives normal POSIX buffered IO
semantics and holding the IO lock exclusive allows sycnhronisation
against new IO of any kind for truncate.

For direct IO, the IO lock is always taken in shared mode, so we can
have concurrent read and write operations taking place at once
regardless of the offset into the file.

> I am looking for some discussion about removing the i_mutex lock in the
> direct IO write code path for ext4, when multiple threads are
> direct write to different offset of the same file. This would require
> some way to track the in-fly DIO IO range, either done at ext4 level or
> above th vfs layer. 

Direct IO semantics have always been that the application is allowed
to overlap IO to the same range if it wants to. The result is
undefined (just like issuing overlapping reads and writes to a disk
at the same time) so it's the application's responsibility to avoid
overlapping IO if it is a problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-29 23:09                   ` Shyam_Iyer
@ 2011-03-30  5:58                     ` Hannes Reinecke
  2011-03-30 14:02                       ` James Bottomley
  0 siblings, 1 reply; 43+ messages in thread
From: Hannes Reinecke @ 2011-03-30  5:58 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: snitzer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote:
>
> Let me back up here.. this has to be thought in not only the traditional Ethernet
 > sense but also in a Data Centre Bridged environment. I shouldn't 
have wandered
 > into the multipath constructs..
>
> I think the statement on not going to the same LUN was a little erroneous. I meant
 > different /dev/sdXs.. and hence different block I/O queues.
>
> Each I/O queue could be thought of as a bandwidth queue class being serviced through
 > a corresponding network adapter's queue(assuming a multiqueue 
capable adapter)
>
> Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group
 > corresponding to a weightage of 20% of the I/O bandwidth the user 
has configured
 > this weight thinking that this will correspond to say 200Mb of 
bandwidth.
>
> Let us say the network bandwidth on the corresponding network queues corresponding
 > was reduced by the DCB capable switch...
> We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.
>
> In such a scenario the option is to move I/O to a different bandwidth priority queue
 > in the network adapter. This could be moving I/O to a new network 
queue in eth0 or
 > another queue in eth1 ..
>
> This requires mapping the block queue to the new network queue.
>
> One way of solving this is what is getting into the open-iscsi world i.e. creating
 > a session tagged to the relevant DCB priority and thus the 
session gets mapped
 > to the relevant tc queue which ultimately maps to one of the 
network adapters multiqueue..
>
> But when multipath fails over to the different session path then the DCB bandwidth
 > priority will not move with it..
>
> Ok one could argue that is a user mistake to have configured bandwidth priorities
 > differently but it may so happen that the bandwidth priority was 
just dynamically
 > changed by the switch for the particular queue.
>
> Although I gave an example of a DCB environment but we could definitely look at
 > doing a 1:n map of block queues to network adapter queues for 
non-DCB environments too..
>
That sounds quite convoluted enough to warrant it's own slot :-)

No, seriously. I think it would be good to have a separate slot 
discussing DCB (be it FCoE or iSCSI) and cgroups.
And how to best align these things.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  2:17     ` Dave Chinner
@ 2011-03-30 11:13       ` Theodore Tso
  2011-03-30 11:28         ` Ric Wheeler
  2011-03-30 21:49       ` Mingming Cao
  1 sibling, 1 reply; 43+ messages in thread
From: Theodore Tso @ 2011-03-30 11:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mingming Cao, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development


On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote:

> Direct IO semantics have always been that the application is allowed
> to overlap IO to the same range if it wants to. The result is
> undefined (just like issuing overlapping reads and writes to a disk
> at the same time) so it's the application's responsibility to avoid
> overlapping IO if it is a problem.

Even if the overlapping read/writes are taking place in different processes?

DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors.  The lack of formal specifications of what applications are guaranteed to receive is unfortunate....

-- Ted


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 11:13       ` Theodore Tso
@ 2011-03-30 11:28         ` Ric Wheeler
  2011-03-30 14:07           ` Chris Mason
  2011-04-01 15:19           ` Ted Ts'o
  0 siblings, 2 replies; 43+ messages in thread
From: Ric Wheeler @ 2011-03-30 11:28 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler

On 03/30/2011 07:13 AM, Theodore Tso wrote:
> On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote:
>
>> Direct IO semantics have always been that the application is allowed
>> to overlap IO to the same range if it wants to. The result is
>> undefined (just like issuing overlapping reads and writes to a disk
>> at the same time) so it's the application's responsibility to avoid
>> overlapping IO if it is a problem.
> Even if the overlapping read/writes are taking place in different processes?
>
> DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors.  The lack of formal specifications of what applications are guaranteed to receive is unfortunate....
>
> -- Ted

What possible semantics could you have?

If you ever write concurrently from multiple processes without locking, you 
clearly are at the mercy of the scheduler and the underlying storage which could 
fragment a single write into multiple IO's sent to the backend device.

I would agree with Dave, let's not make it overly complicated or try to give 
people "atomic" unbounded size writes just because they set the O_DIRECT flag :)

Ric


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
@ 2011-03-30 14:02                       ` James Bottomley
  2011-03-30 14:10                         ` Hannes Reinecke
  0 siblings, 1 reply; 43+ messages in thread
From: James Bottomley @ 2011-03-30 14:02 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote:
> >
> > Let me back up here.. this has to be thought in not only the traditional Ethernet
>  > sense but also in a Data Centre Bridged environment. I shouldn't 
> have wandered
>  > into the multipath constructs..
> >
> > I think the statement on not going to the same LUN was a little erroneous. I meant
>  > different /dev/sdXs.. and hence different block I/O queues.
> >
> > Each I/O queue could be thought of as a bandwidth queue class being serviced through
>  > a corresponding network adapter's queue(assuming a multiqueue 
> capable adapter)
> >
> > Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group
>  > corresponding to a weightage of 20% of the I/O bandwidth the user 
> has configured
>  > this weight thinking that this will correspond to say 200Mb of 
> bandwidth.
> >
> > Let us say the network bandwidth on the corresponding network queues corresponding
>  > was reduced by the DCB capable switch...
> > We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.
> >
> > In such a scenario the option is to move I/O to a different bandwidth priority queue
>  > in the network adapter. This could be moving I/O to a new network 
> queue in eth0 or
>  > another queue in eth1 ..
> >
> > This requires mapping the block queue to the new network queue.
> >
> > One way of solving this is what is getting into the open-iscsi world i.e. creating
>  > a session tagged to the relevant DCB priority and thus the 
> session gets mapped
>  > to the relevant tc queue which ultimately maps to one of the 
> network adapters multiqueue..
> >
> > But when multipath fails over to the different session path then the DCB bandwidth
>  > priority will not move with it..
> >
> > Ok one could argue that is a user mistake to have configured bandwidth priorities
>  > differently but it may so happen that the bandwidth priority was 
> just dynamically
>  > changed by the switch for the particular queue.
> >
> > Although I gave an example of a DCB environment but we could definitely look at
>  > doing a 1:n map of block queues to network adapter queues for 
> non-DCB environments too..
> >
> That sounds quite convoluted enough to warrant it's own slot :-)
> 
> No, seriously. I think it would be good to have a separate slot 
> discussing DCB (be it FCoE or iSCSI) and cgroups.
> And how to best align these things.

OK, I'll go for that ... Data Centre Bridging; experiences, technologies
and needs ... something like that.  What about virtualisation and open
vSwitch?

James



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 11:28         ` Ric Wheeler
@ 2011-03-30 14:07           ` Chris Mason
  2011-04-01 15:19           ` Ted Ts'o
  1 sibling, 0 replies; 43+ messages in thread
From: Chris Mason @ 2011-03-30 14:07 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Dave Chinner, lsf, linux-scsi@vger.kernel.org,
	James Bottomley, device-mapper development, linux-fsdevel,
	Ric Wheeler

Excerpts from Ric Wheeler's message of 2011-03-30 07:28:34 -0400:
> On 03/30/2011 07:13 AM, Theodore Tso wrote:
> > On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote:
> >
> >> Direct IO semantics have always been that the application is allowed
> >> to overlap IO to the same range if it wants to. The result is
> >> undefined (just like issuing overlapping reads and writes to a disk
> >> at the same time) so it's the application's responsibility to avoid
> >> overlapping IO if it is a problem.
> > Even if the overlapping read/writes are taking place in different processes?
> >
> > DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors.  The lack of formal specifications of what applications are guaranteed to receive is unfortunate....
> >
> > -- Ted
> 
> What possible semantics could you have?
> 
> If you ever write concurrently from multiple processes without locking, you 
> clearly are at the mercy of the scheduler and the underlying storage which could 
> fragment a single write into multiple IO's sent to the backend device.
> 
> I would agree with Dave, let's not make it overly complicated or try to give 
> people "atomic" unbounded size writes just because they set the O_DIRECT flag :)

We've talked about this with the oracle database people at least, any
concurrent O_DIRECT ios to the same area would be considered a db bug.
As long as it doesn't make the kernel crash or hang, we can return
one of these: http://www.youtube.com/watch?v=rX7wtNOkuHo

IBM might have a different answer, but I don't see how you can have good
results from mixing concurrent IOs.

-chris

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:02                       ` James Bottomley
@ 2011-03-30 14:10                         ` Hannes Reinecke
  2011-03-30 14:26                           ` James Bottomley
  0 siblings, 1 reply; 43+ messages in thread
From: Hannes Reinecke @ 2011-03-30 14:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On 03/30/2011 04:02 PM, James Bottomley wrote:
> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
>> On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote:
>>>
>>> Let me back up here.. this has to be thought in not only the traditional Ethernet
>>   >  sense but also in a Data Centre Bridged environment. I shouldn't
>> have wandered
>>   >  into the multipath constructs..
>>>
>>> I think the statement on not going to the same LUN was a little erroneous. I meant
>>   >  different /dev/sdXs.. and hence different block I/O queues.
>>>
>>> Each I/O queue could be thought of as a bandwidth queue class being serviced through
>>   >  a corresponding network adapter's queue(assuming a multiqueue
>> capable adapter)
>>>
>>> Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group
>>   >  corresponding to a weightage of 20% of the I/O bandwidth the user
>> has configured
>>   >  this weight thinking that this will correspond to say 200Mb of
>> bandwidth.
>>>
>>> Let us say the network bandwidth on the corresponding network queues corresponding
>>   >  was reduced by the DCB capable switch...
>>> We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed.
>>>
>>> In such a scenario the option is to move I/O to a different bandwidth priority queue
>>   >  in the network adapter. This could be moving I/O to a new network
>> queue in eth0 or
>>   >  another queue in eth1 ..
>>>
>>> This requires mapping the block queue to the new network queue.
>>>
>>> One way of solving this is what is getting into the open-iscsi world i.e. creating
>>   >  a session tagged to the relevant DCB priority and thus the
>> session gets mapped
>>   >  to the relevant tc queue which ultimately maps to one of the
>> network adapters multiqueue..
>>>
>>> But when multipath fails over to the different session path then the DCB bandwidth
>>   >  priority will not move with it..
>>>
>>> Ok one could argue that is a user mistake to have configured bandwidth priorities
>>   >  differently but it may so happen that the bandwidth priority was
>> just dynamically
>>   >  changed by the switch for the particular queue.
>>>
>>> Although I gave an example of a DCB environment but we could definitely look at
>>   >  doing a 1:n map of block queues to network adapter queues for
>> non-DCB environments too..
>>>
>> That sounds quite convoluted enough to warrant it's own slot :-)
>>
>> No, seriously. I think it would be good to have a separate slot
>> discussing DCB (be it FCoE or iSCSI) and cgroups.
>> And how to best align these things.
>
> OK, I'll go for that ... Data Centre Bridging; experiences, technologies
> and needs ... something like that.  What about virtualisation and open
> vSwitch?
>
Hmm. Not qualified enough to talk about the latter; I was more 
envisioning the storage-related aspects here (multiqueue mapping, 
QoS classes etc). With virtualisation and open vSwitch we're more in
the network side of things; doubt open vSwitch can do DCB.
And even if it could, virtio certainly can't :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:10                         ` Hannes Reinecke
@ 2011-03-30 14:26                           ` James Bottomley
  2011-03-30 14:55                             ` Hannes Reinecke
  0 siblings, 1 reply; 43+ messages in thread
From: James Bottomley @ 2011-03-30 14:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> On 03/30/2011 04:02 PM, James Bottomley wrote:
> > On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> >> No, seriously. I think it would be good to have a separate slot
> >> discussing DCB (be it FCoE or iSCSI) and cgroups.
> >> And how to best align these things.
> >
> > OK, I'll go for that ... Data Centre Bridging; experiences, technologies
> > and needs ... something like that.  What about virtualisation and open
> > vSwitch?
> >
> Hmm. Not qualified enough to talk about the latter; I was more 
> envisioning the storage-related aspects here (multiqueue mapping, 
> QoS classes etc). With virtualisation and open vSwitch we're more in
> the network side of things; doubt open vSwitch can do DCB.
> And even if it could, virtio certainly can't :-)

Technically, the topic DCB is about Data Centre Ethernet enhancements
and converged networks ... that's why it's naturally allied to virtual
switching.

I was thinking we might put up a panel of vendors to get us all an
education on the topic ...

James



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:26                           ` James Bottomley
@ 2011-03-30 14:55                             ` Hannes Reinecke
  2011-03-30 15:33                               ` James Bottomley
  0 siblings, 1 reply; 43+ messages in thread
From: Hannes Reinecke @ 2011-03-30 14:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On 03/30/2011 04:26 PM, James Bottomley wrote:
> On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
>> On 03/30/2011 04:02 PM, James Bottomley wrote:
>>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
>>>> No, seriously. I think it would be good to have a separate slot
>>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
>>>> And how to best align these things.
>>>
>>> OK, I'll go for that ... Data Centre Bridging; experiences, technologies
>>> and needs ... something like that.  What about virtualisation and open
>>> vSwitch?
>>>
>> Hmm. Not qualified enough to talk about the latter; I was more
>> envisioning the storage-related aspects here (multiqueue mapping,
>> QoS classes etc). With virtualisation and open vSwitch we're more in
>> the network side of things; doubt open vSwitch can do DCB.
>> And even if it could, virtio certainly can't :-)
>
> Technically, the topic DCB is about Data Centre Ethernet enhancements
> and converged networks ... that's why it's naturally allied to virtual
> switching.
>
> I was thinking we might put up a panel of vendors to get us all an
> education on the topic ...
>
Oh, but gladly.
Didn't know we had some at the LSF.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 14:55                             ` Hannes Reinecke
@ 2011-03-30 15:33                               ` James Bottomley
  2011-03-30 15:46                                 ` Shyam_Iyer
  2011-03-30 20:32                                 ` Giridhar Malavali
  0 siblings, 2 replies; 43+ messages in thread
From: James Bottomley @ 2011-03-30 15:33 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler

On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
> On 03/30/2011 04:26 PM, James Bottomley wrote:
> > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> >> On 03/30/2011 04:02 PM, James Bottomley wrote:
> >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> >>>> No, seriously. I think it would be good to have a separate slot
> >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
> >>>> And how to best align these things.
> >>>
> >>> OK, I'll go for that ... Data Centre Bridging; experiences, technologies
> >>> and needs ... something like that.  What about virtualisation and open
> >>> vSwitch?
> >>>
> >> Hmm. Not qualified enough to talk about the latter; I was more
> >> envisioning the storage-related aspects here (multiqueue mapping,
> >> QoS classes etc). With virtualisation and open vSwitch we're more in
> >> the network side of things; doubt open vSwitch can do DCB.
> >> And even if it could, virtio certainly can't :-)
> >
> > Technically, the topic DCB is about Data Centre Ethernet enhancements
> > and converged networks ... that's why it's naturally allied to virtual
> > switching.
> >
> > I was thinking we might put up a panel of vendors to get us all an
> > education on the topic ...
> >
> Oh, but gladly.
> Didn't know we had some at the LSF.

OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
Emulex (James Smart) but any other attending vendors who want to pitch
in, send me an email and I'll add you.

James



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 15:33                               ` James Bottomley
@ 2011-03-30 15:46                                 ` Shyam_Iyer
  2011-03-30 20:32                                 ` Giridhar Malavali
  1 sibling, 0 replies; 43+ messages in thread
From: Shyam_Iyer @ 2011-03-30 15:46 UTC (permalink / raw)
  To: James.Bottomley, hare; +Cc: linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler



> -----Original Message-----
> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Sent: Wednesday, March 30, 2011 11:34 AM
> To: Hannes Reinecke
> Cc: Iyer, Shyam; linux-scsi@vger.kernel.org; lsf@lists.linux-
> foundation.org; dm-devel@redhat.com; linux-fsdevel@vger.kernel.org;
> rwheeler@redhat.com
> Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF
> 
> On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
> > On 03/30/2011 04:26 PM, James Bottomley wrote:
> > > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
> > >> On 03/30/2011 04:02 PM, James Bottomley wrote:
> > >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
> > >>>> No, seriously. I think it would be good to have a separate slot
> > >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
> > >>>> And how to best align these things.
> > >>>
> > >>> OK, I'll go for that ... Data Centre Bridging; experiences,
> technologies
> > >>> and needs ... something like that.  What about virtualisation and
> open
> > >>> vSwitch?
> > >>>
> > >> Hmm. Not qualified enough to talk about the latter; I was more
> > >> envisioning the storage-related aspects here (multiqueue mapping,
> > >> QoS classes etc). With virtualisation and open vSwitch we're more
> in
> > >> the network side of things; doubt open vSwitch can do DCB.
> > >> And even if it could, virtio certainly can't :-)
> > >
> > > Technically, the topic DCB is about Data Centre Ethernet
> enhancements
> > > and converged networks ... that's why it's naturally allied to
> virtual
> > > switching.
> > >
> > > I was thinking we might put up a panel of vendors to get us all an
> > > education on the topic ...
> > >
> > Oh, but gladly.
> > Didn't know we had some at the LSF.
> 
> OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
> Emulex (James Smart) but any other attending vendors who want to pitch
> in, send me an email and I'll add you.
> 
> James
> 
Excellent.
I would probably volunteer Giridhar(Qlogic) as well looking at the list of attendees as some of the CNA implementations vary.. 

-Shyam


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 15:33                               ` James Bottomley
  2011-03-30 15:46                                 ` Shyam_Iyer
@ 2011-03-30 20:32                                 ` Giridhar Malavali
  2011-03-30 20:45                                   ` James Bottomley
  1 sibling, 1 reply; 43+ messages in thread
From: Giridhar Malavali @ 2011-03-30 20:32 UTC (permalink / raw)
  To: James Bottomley, Hannes Reinecke
  Cc: Shyam_Iyer@dell.com, linux-scsi@vger.kernel.org,
	lsf@lists.linux-foundation.org, dm-devel@redhat.com,
	linux-fsdevel@vger.kernel.org, rwheeler@redhat.com



>>

>On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote:
>> On 03/30/2011 04:26 PM, James Bottomley wrote:
>> > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote:
>> >> On 03/30/2011 04:02 PM, James Bottomley wrote:
>> >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote:
>> >>>> No, seriously. I think it would be good to have a separate slot
>> >>>> discussing DCB (be it FCoE or iSCSI) and cgroups.
>> >>>> And how to best align these things.
>> >>>
>> >>> OK, I'll go for that ... Data Centre Bridging; experiences,
>>technologies
>> >>> and needs ... something like that.  What about virtualisation and
>>open
>> >>> vSwitch?
>> >>>
>> >> Hmm. Not qualified enough to talk about the latter; I was more
>> >> envisioning the storage-related aspects here (multiqueue mapping,
>> >> QoS classes etc). With virtualisation and open vSwitch we're more in
>> >> the network side of things; doubt open vSwitch can do DCB.
>> >> And even if it could, virtio certainly can't :-)
>> >
>> > Technically, the topic DCB is about Data Centre Ethernet enhancements
>> > and converged networks ... that's why it's naturally allied to virtual
>> > switching.
>> >
>> > I was thinking we might put up a panel of vendors to get us all an
>> > education on the topic ...
>> >
>> Oh, but gladly.
>> Didn't know we had some at the LSF.
>
>OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
>Emulex (James Smart) but any other attending vendors who want to pitch
>in, send me an email and I'll add you.

Can u please add me for this.

-- Giridhar



>
>James
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 20:32                                 ` Giridhar Malavali
@ 2011-03-30 20:45                                   ` James Bottomley
  0 siblings, 0 replies; 43+ messages in thread
From: James Bottomley @ 2011-03-30 20:45 UTC (permalink / raw)
  To: Giridhar Malavali
  Cc: Hannes Reinecke, Shyam_Iyer@dell.com, linux-scsi@vger.kernel.org,
	lsf@lists.linux-foundation.org, dm-devel@redhat.com,
	linux-fsdevel@vger.kernel.org, rwheeler@redhat.com

On Wed, 2011-03-30 at 13:32 -0700, Giridhar Malavali wrote:
> >OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and
> >Emulex (James Smart) but any other attending vendors who want to pitch
> >in, send me an email and I'll add you.
> 
> Can u please add me for this.

I already did.  (The agenda web actually updates about 5 minutes behind
the driving spreadsheet).

James



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30  2:17     ` Dave Chinner
  2011-03-30 11:13       ` Theodore Tso
@ 2011-03-30 21:49       ` Mingming Cao
  2011-03-31  0:05         ` Matthew Wilcox
  2011-03-31  1:00         ` Joel Becker
  1 sibling, 2 replies; 43+ messages in thread
From: Mingming Cao @ 2011-03-30 21:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development

On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote:
> On Tue, Mar 29, 2011 at 05:33:30PM -0700, Mingming Cao wrote:
> > Ric,
> > 
> > May I propose some discussion about concurrent direct IO support for
> > ext4?
> 
> Just look at the way XFS does it and copy that?  i.e. it has a
> filesytem level IO lock and an inode lock both with shared/exclusive
> semantics. These lie below the i_mutex (i.e. locking order is
> i_mutex, i_iolock, i_ilock), and effectively result in the i_mutex
> only being used for VFS level synchronisation and as such is rarely
> used inside XFS itself.
> 
> Inode attribute operations are protected by the inode lock, while IO
> operations and truncation synchronisation is provided by the IO
> lock.
> 

Right, inode attribute operations should be covered by the i_lock. in
ext4 the i_mutex is used to protect IO and truncation synch, along with
the i_datasem to pretect concurrent access.modification to file's
allocation. 

> So for buffered IO, the IO lock is used in shared mode for reads
> and exclusive mode for writes. This gives normal POSIX buffered IO
> semantics and holding the IO lock exclusive allows sycnhronisation
> against new IO of any kind for truncate.
> 
> For direct IO, the IO lock is always taken in shared mode, so we can
> have concurrent read and write operations taking place at once
> regardless of the offset into the file.
> 

thanks for reminding me,in xfs concurrent direct IO write to the same
offset is allowed.

> > I am looking for some discussion about removing the i_mutex lock in the
> > direct IO write code path for ext4, when multiple threads are
> > direct write to different offset of the same file. This would require
> > some way to track the in-fly DIO IO range, either done at ext4 level or
> > above th vfs layer. 
> 
> Direct IO semantics have always been that the application is allowed
> to overlap IO to the same range if it wants to. The result is
> undefined (just like issuing overlapping reads and writes to a disk
> at the same time) so it's the application's responsibility to avoid
> overlapping IO if it is a problem.
> 


I was thinking along the line to provide finer granularity lock to allow
concurrent direct IO to different offset/range, but to same offset, they
have to be serialized. If it's undefined behavior, i.e. overlapping is
allowed, then concurrent dio implementation is much easier. But not sure
if any apps currently using DIO aware of the ordering has to be done at
the application level. 

> Cheers,
> 
> Dave.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 21:49       ` Mingming Cao
@ 2011-03-31  0:05         ` Matthew Wilcox
  2011-03-31  1:00         ` Joel Becker
  1 sibling, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2011-03-31  0:05 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development

On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote:
> > Direct IO semantics have always been that the application is allowed
> > to overlap IO to the same range if it wants to. The result is
> > undefined (just like issuing overlapping reads and writes to a disk
> > at the same time) so it's the application's responsibility to avoid
> > overlapping IO if it is a problem.
> 
> I was thinking along the line to provide finer granularity lock to allow
> concurrent direct IO to different offset/range, but to same offset, they
> have to be serialized. If it's undefined behavior, i.e. overlapping is
> allowed, then concurrent dio implementation is much easier. But not sure
> if any apps currently using DIO aware of the ordering has to be done at
> the application level. 

Yes, they're aware of it.  And they consider it a bug if they ever do
concurrent I/O to the same sector.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 21:49       ` Mingming Cao
  2011-03-31  0:05         ` Matthew Wilcox
@ 2011-03-31  1:00         ` Joel Becker
  2011-04-01 21:34           ` Mingming Cao
  1 sibling, 1 reply; 43+ messages in thread
From: Joel Becker @ 2011-03-31  1:00 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development

On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote:
> On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote:
> > For direct IO, the IO lock is always taken in shared mode, so we can
> > have concurrent read and write operations taking place at once
> > regardless of the offset into the file.
> > 
> 
> thanks for reminding me,in xfs concurrent direct IO write to the same
> offset is allowed.

	ocfs2 as well, with the same sort of strategem (including across
the cluster).

> > Direct IO semantics have always been that the application is allowed
> > to overlap IO to the same range if it wants to. The result is
> > undefined (just like issuing overlapping reads and writes to a disk
> > at the same time) so it's the application's responsibility to avoid
> > overlapping IO if it is a problem.
> > 
> 
> I was thinking along the line to provide finer granularity lock to allow
> concurrent direct IO to different offset/range, but to same offset, they
> have to be serialized. If it's undefined behavior, i.e. overlapping is
> allowed, then concurrent dio implementation is much easier. But not sure
> if any apps currently using DIO aware of the ordering has to be done at
> the application level. 

	Oh dear God no.  One of the major DIO use cases is to tell the
kernel, "I know I won't do that, so don't spend any effort protecting
me."

Joel

-- 

"I don't want to achieve immortality through my work; I want to
 achieve immortality through not dying."
        - Woody Allen

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-30 11:28         ` Ric Wheeler
  2011-03-30 14:07           ` Chris Mason
@ 2011-04-01 15:19           ` Ted Ts'o
  2011-04-01 16:30             ` Amir Goldstein
  2011-04-01 21:43             ` Joel Becker
  1 sibling, 2 replies; 43+ messages in thread
From: Ted Ts'o @ 2011-04-01 15:19 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler

On Wed, Mar 30, 2011 at 07:28:34AM -0400, Ric Wheeler wrote:
> 
> What possible semantics could you have?
> 
> If you ever write concurrently from multiple processes without
> locking, you clearly are at the mercy of the scheduler and the
> underlying storage which could fragment a single write into multiple
> IO's sent to the backend device.
> 
> I would agree with Dave, let's not make it overly complicated or try
> to give people "atomic" unbounded size writes just because they set
> the O_DIRECT flag :)

I just want to have it written down.  After getting burned with ext3's
semantics promising more than what the standard guaranteed, I've just
gotten paranoid about application programmers getting upset when
things change on them --- and in the case of direct I/O, this stuff
isn't even clearly documented anywhere official.

I just think it's best that we document it the fact that concurrent
DIO's to the same region may result in completely arbitrary behaviour,
make sure it's well publicized to likely users (and I'm more worried
about the open source code bases than Oracle DB), and then call it a day.

The closest place that we have to any official documentation about
O_DIRECT semantics is the open(2) man page in the Linux manpages, and
it doesn't say anything about this.  It does give a recommendation
against not mixing buffered and O_DIRECT accesses to the same file,
but it does promise that things will work in that case.  (Even if it
does, do we really want to make the promise that it will always work?)

In any case, adding some text in that paragraph, or just after that
paragraph, to the effect that two concurrent DIO accesses to the same
file block, even by two different processes will result in undefined
behavior would be a good start.

      	       	      	   	      	     - Ted

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 15:19           ` Ted Ts'o
@ 2011-04-01 16:30             ` Amir Goldstein
  2011-04-01 21:46               ` Joel Becker
  2011-04-01 21:43             ` Joel Becker
  1 sibling, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2011-04-01 16:30 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org,
	James Bottomley, device-mapper development, linux-fsdevel,
	Ric Wheeler, Yongqiang Yang

On Fri, Apr 1, 2011 at 8:19 AM, Ted Ts'o <tytso@mit.edu> wrote:
> On Wed, Mar 30, 2011 at 07:28:34AM -0400, Ric Wheeler wrote:
>>
>> What possible semantics could you have?
>>
>> If you ever write concurrently from multiple processes without
>> locking, you clearly are at the mercy of the scheduler and the
>> underlying storage which could fragment a single write into multiple
>> IO's sent to the backend device.
>>
>> I would agree with Dave, let's not make it overly complicated or try
>> to give people "atomic" unbounded size writes just because they set
>> the O_DIRECT flag :)
>
> I just want to have it written down.  After getting burned with ext3's
> semantics promising more than what the standard guaranteed, I've just
> gotten paranoid about application programmers getting upset when
> things change on them --- and in the case of direct I/O, this stuff
> isn't even clearly documented anywhere official.
>
> I just think it's best that we document it the fact that concurrent
> DIO's to the same region may result in completely arbitrary behaviour,
> make sure it's well publicized to likely users (and I'm more worried
> about the open source code bases than Oracle DB), and then call it a day.
>
> The closest place that we have to any official documentation about
> O_DIRECT semantics is the open(2) man page in the Linux manpages, and
> it doesn't say anything about this.  It does give a recommendation
> against not mixing buffered and O_DIRECT accesses to the same file,
> but it does promise that things will work in that case.  (Even if it
> does, do we really want to make the promise that it will always work?)

when writing DIO to indirect mapped file holes, we fall back to buffered write
(so we won't expose stale data in the case of a crash) concurrent DIO reads
to that file (before data writeback) can expose stale data. right?
do you consider this case mixing buffered and DIO access?
do you consider that as a problem?

the case interests me because I am afraid we may have to use the fallback
trick for extent move on write from DIO (we did so in current
implementation anyway).

of course, if we end up implementing in-memory extent tree, we will probably be
able to cope with DIO MOW without fallback to buffered IO.

>
> In any case, adding some text in that paragraph, or just after that
> paragraph, to the effect that two concurrent DIO accesses to the same
> file block, even by two different processes will result in undefined
> behavior would be a good start.
>
>                                             - Ted
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-03-31  1:00         ` Joel Becker
@ 2011-04-01 21:34           ` Mingming Cao
  2011-04-01 21:49             ` Joel Becker
  0 siblings, 1 reply; 43+ messages in thread
From: Mingming Cao @ 2011-04-01 21:34 UTC (permalink / raw)
  To: Joel Becker
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development

On Wed, 2011-03-30 at 18:00 -0700, Joel Becker wrote: 
> On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote:
> > On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote:
> > > For direct IO, the IO lock is always taken in shared mode, so we can
> > > have concurrent read and write operations taking place at once
> > > regardless of the offset into the file.
> > > 
> > 
> > thanks for reminding me,in xfs concurrent direct IO write to the same
> > offset is allowed.
> 
> 	ocfs2 as well, with the same sort of strategem (including across
> the cluster).
> 
Thanks for providing view from OCFS2 side. This is good to know.

> > > Direct IO semantics have always been that the application is allowed
> > > to overlap IO to the same range if it wants to. The result is
> > > undefined (just like issuing overlapping reads and writes to a disk
> > > at the same time) so it's the application's responsibility to avoid
> > > overlapping IO if it is a problem.
> > > 
> > 
> > I was thinking along the line to provide finer granularity lock to allow
> > concurrent direct IO to different offset/range, but to same offset, they
> > have to be serialized. If it's undefined behavior, i.e. overlapping is
> > allowed, then concurrent dio implementation is much easier. But not sure
> > if any apps currently using DIO aware of the ordering has to be done at
> > the application level. 
> 
> 	Oh dear God no.  One of the major DIO use cases is to tell the
> kernel, "I know I won't do that, so don't spend any effort protecting
> me."
> 
> Joel
> 

Looks like so -

So I think we could have a mode to turn on/off concurrent dio if the non
heavy duty applications relies on filesystem to take care of the
serialization.

Mingming




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 15:19           ` Ted Ts'o
  2011-04-01 16:30             ` Amir Goldstein
@ 2011-04-01 21:43             ` Joel Becker
  1 sibling, 0 replies; 43+ messages in thread
From: Joel Becker @ 2011-04-01 21:43 UTC (permalink / raw)
  To: Ted Ts'o, Ric Wheeler, Dave Chinner, lsf,
	linux-scsi@vger.kernel.org

On Fri, Apr 01, 2011 at 11:19:07AM -0400, Ted Ts'o wrote:
> The closest place that we have to any official documentation about
> O_DIRECT semantics is the open(2) man page in the Linux manpages, and
> it doesn't say anything about this.  It does give a recommendation
> against not mixing buffered and O_DIRECT accesses to the same file,
> but it does promise that things will work in that case.  (Even if it
> does, do we really want to make the promise that it will always work?)

	No, we do not.  Some OSes will silently turn buffered I/O into
direct I/O if another file already has it opened O_DIRECT.  Some OSes
will fail the write, or the open, or both, if it doesn't match the mode
of an existing fd.  Some just leave O_DIRECT and buffered access
inconsistent.
	I think that Linux should strive to make the mixed
buffered/direct case work; it's the nicest thing we can do.  But we
should not promise it.

Joel

-- 

Life's Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 16:30             ` Amir Goldstein
@ 2011-04-01 21:46               ` Joel Becker
  2011-04-02  3:26                 ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Joel Becker @ 2011-04-01 21:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Tso, Ric Wheeler, Dave Chinner, lsf,
	linux-scsi@vger.kernel.org, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler,
	Yongqiang Yang

On Fri, Apr 01, 2011 at 09:30:04AM -0700, Amir Goldstein wrote:
> when writing DIO to indirect mapped file holes, we fall back to buffered write
> (so we won't expose stale data in the case of a crash) concurrent DIO reads
> to that file (before data writeback) can expose stale data. right?
> do you consider this case mixing buffered and DIO access?
> do you consider that as a problem?

	I do not consider this 'mixing', nor do I consider it a problem.
ocfs2 does exactly this for holes, unwritten extents, and CoW.  It does
not violate the user's expectation that the data will be on disk when
the write(2) returns.
	Falling back to buffered on read(2) is a different story; the
caller wants the current state of the disk block, not five minutes ago.
So we can't do that.  But we also don't need to.
	O_DIRECT users that are worried about any possible space usage in
the page cache have already pre-allocated their disk blocks and don't
get here.

Joel

-- 

"Under capitalism, man exploits man.  Under Communism, it's just 
   the opposite."
				 - John Kenneth Galbraith

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 21:34           ` Mingming Cao
@ 2011-04-01 21:49             ` Joel Becker
  0 siblings, 0 replies; 43+ messages in thread
From: Joel Becker @ 2011-04-01 21:49 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel,
	linux-scsi@vger.kernel.org, device-mapper development

On Fri, Apr 01, 2011 at 02:34:26PM -0700, Mingming Cao wrote:
> > > I was thinking along the line to provide finer granularity lock to allow
> > > concurrent direct IO to different offset/range, but to same offset, they
> > > have to be serialized. If it's undefined behavior, i.e. overlapping is
> > > allowed, then concurrent dio implementation is much easier. But not sure
> > > if any apps currently using DIO aware of the ordering has to be done at
> > > the application level. 
> > 
> > 	Oh dear God no.  One of the major DIO use cases is to tell the
> > kernel, "I know I won't do that, so don't spend any effort protecting
> > me."
> > 
> > Joel
> > 
> 
> Looks like so -
> 
> So I think we could have a mode to turn on/off concurrent dio if the non
> heavy duty applications relies on filesystem to take care of the
> serialization.

	I would prefer to leave this complexity out.  If you must have
it, unsafe, concurrent DIO must be the default.  Let the people who
really want it turn on serialized DIO.

Joel

-- 

"Get right to the heart of matters.
 It's the heart that matters more."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf] Preliminary Agenda and Activities for LSF
  2011-04-01 21:46               ` Joel Becker
@ 2011-04-02  3:26                 ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2011-04-02  3:26 UTC (permalink / raw)
  To: Joel Becker
  Cc: Theodore Tso, Ric Wheeler, Dave Chinner, lsf,
	linux-scsi@vger.kernel.org, James Bottomley,
	device-mapper development, linux-fsdevel, Ric Wheeler,
	Yongqiang Yang

On Fri, Apr 1, 2011 at 2:46 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Fri, Apr 01, 2011 at 09:30:04AM -0700, Amir Goldstein wrote:
>> when writing DIO to indirect mapped file holes, we fall back to buffered write
>> (so we won't expose stale data in the case of a crash) concurrent DIO reads
>> to that file (before data writeback) can expose stale data. right?
>> do you consider this case mixing buffered and DIO access?
>> do you consider that as a problem?
>
>        I do not consider this 'mixing', nor do I consider it a problem.
> ocfs2 does exactly this for holes, unwritten extents, and CoW.  It does
> not violate the user's expectation that the data will be on disk when
> the write(2) returns.
>        Falling back to buffered on read(2) is a different story; the
> caller wants the current state of the disk block, not five minutes ago.
> So we can't do that.  But we also don't need to.

the issue is with DIO read exposing uninitialized data on disk
is a security issue.
it's not about giving the read what is expects to see.

>        O_DIRECT users that are worried about any possible space usage in
> the page cache have already pre-allocated their disk blocks and don't
> get here.
>
> Joel
>
> --
>
> "Under capitalism, man exploits man.  Under Communism, it's just
>   the opposite."
>                                 - John Kenneth Galbraith
>
>                        http://www.jlbec.org/
>                        jlbec@evilplan.org
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2011-04-02  3:26 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29 11:16 ` [Lsf] Preliminary Agenda and Activities for LSF Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).