Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 9/24] make atomic_read() behave consistently on ia64
From: Geert Uytterhoeven @ 2007-08-13  7:39 UTC (permalink / raw)
  To: Chris Snook
  Cc: Paul Mackerras, Linus Torvalds, Luck, Tony, Andreas Schwab,
	linux-kernel, linux-arch, netdev, akpm, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl
In-Reply-To: <46BFFA71.4080601@redhat.com>

On Mon, 13 Aug 2007, Chris Snook wrote:
> Paul Mackerras wrote:
> > Chris Snook writes:
> > > I'll do this for the whole patchset.  Stay tuned for the resubmit.
> > 
> > Could you incorporate Segher's patch to turn atomic_{read,set} into
> > asm on powerpc?  Segher claims that using asm is really the only
> > reliable way to ensure that gcc does what we want, and he seems to
> > have a point.
> > 
> > Paul.
> 
> I haven't seen a patch yet.  I'm going to resubmit with inline volatile-cast
> atomic[64]_[read|set] on all architectures as a reference point, and if anyone
> wants to go and implement some of them in assembly, that's between them and
> the relevant arch maintainers.  I have no problem with (someone else) doing it
> in assembly.  I just don't think it's necessary and won't let it hold up the
> effort to get consistent behavior on all architectures.

http://lkml.org/lkml/2007/8/10/470

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply

* Re: Distributed storage.
From: Jens Axboe @ 2007-08-13  9:13 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Evgeniy Polyakov, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <200708130208.57542.phillips@phunq.net>

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 00:45, Jens Axboe wrote:
> > On Mon, Aug 13 2007, Jens Axboe wrote:
> > > > You did not comment on the one about putting the bio destructor
> > > > in the ->endio handler, which looks dead simple.  The majority of
> > > > cases just use the default endio handler and the default
> > > > destructor.  Of the remaining cases, where a specialized
> > > > destructor is needed, typically a specialized endio handler is
> > > > too, so combining is free.  There are few if any cases where a
> > > > new specialized endio handler would need to be written.
> > >
> > > We could do that without too much work, I agree.
> >
> > But that idea fails as well, since reference counts and IO completion
> > are two completely seperate entities. So unless end IO just happens
> > to be the last user holding a reference to the bio, you cannot free
> > it.
> 
> That is not a problem.  When bio_put hits zero it calls ->endio instead 
> of the destructor.  The ->endio sees that the count is zero and 
> destroys the bio.

You can't be serious? You'd stall end io completion notification because
someone holds a reference to a bio. Surely you jest.

Needless to say, that will never go in.

-- 
Jens Axboe


^ permalink raw reply

* Re: Distributed storage.
From: Jens Axboe @ 2007-08-13  9:12 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Evgeniy Polyakov, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <200708130159.22407.phillips@phunq.net>

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 00:28, Jens Axboe wrote:
> > On Sun, Aug 12 2007, Daniel Phillips wrote:
> > > Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
> > > derive efficiently from BIO_POOL_IDX() provided the bio was
> > > allocated in the standard way.
> >
> > That would only be feasible, if we ruled that any bio in the system
> > must originate from the standard pools.
> 
> Not at all.
> 
> > > This leaves a little bit of clean up to do for bios not allocated
> > > from a standard pool.
> >
> > Please suggest how to do such a cleanup.
> 
> Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
> bio from the standard pool.  Just put the power of two size in the bits 
> and map that number to the standard pool arrangement with a table 
> lookup.

So reserve a bit that tells you how to interpret the (now) 3 remaining
bits. Doesn't sound very pretty, does it?

> > > On the other hand, vm writeout deadlock ranks smack dab at the top
> > > of the list, so that is where the patching effort must go for the
> > > forseeable future.  Without bio throttling, the ddsnap load can go
> > > to 24 MB for struct bio alone.  That definitely moves the needle. 
> > > in short, we save 3,200 times more memory by putting decent
> > > throttling in place than by saving an int in struct bio.
> >
> > Then fix the damn vm writeout. I always thought it was silly to
> > depend on the block layer for any sort of throttling. If it's not a
> > system wide problem, then throttle the io count in the
> > make_request_fn handler of that problematic driver.
> 
> It is a system wide problem.  Every block device needs throttling, 
> otherwise queues expand without limit.  Currently, block devices that 
> use the standard request library get a slipshod form of throttling for 
> free in the form of limiting in-flight request structs.  Because the 
> amount of IO carried by a single request can vary by two orders of 
> magnitude, the system behavior of this approach is far from 
> predictable.

Is it? Consider just 10 standard sata disks. The next kernel revision
will have sg chaining support, so that allows 32MiB per request. Even if
we disregard reads (not so interesting in this discussion) and just look
at potentially pinned dirty data in a single queue, that number comes to
4GiB PER disk. Or 40GiB for 10 disks. Auch.

So I still think that this throttling needs to happen elsewhere, you
cannot rely the block layer throttling globally or for a single device.
It just doesn't make sense.

> > > You did not comment on the one about putting the bio destructor in
> > > the ->endio handler, which looks dead simple.  The majority of
> > > cases just use the default endio handler and the default
> > > destructor.  Of the remaining cases, where a specialized destructor
> > > is needed, typically a specialized endio handler is too, so
> > > combining is free.  There are few if any cases where a new
> > > specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
> 
> OK, we got one and another is close to cracking, enough of that.

No we did not, I already failed this one in the next mail.

> > > As far as code stability goes, current kernels are horribly
> > > unstable in a variety of contexts because of memory deadlock and
> > > slowdowns related to the attempt to fix the problem via dirty
> > > memory limits.  Accurate throttling of bio traffic is one of the
> > > two key requirements to fix this instability, the other other is
> > > accurate writeout path reserve management, which is only partially
> > > addressed by BIO_POOL.
> >
> > Which, as written above and stated many times over the years on lkml,
> > is not a block layer issue imho.
> 
> Whoever stated that was wrong, but this should be no surprise.  There 
> have been many wrong things said about this particular bug over the 
> years.  The one thing that remains constant is, Linux continues to 
> deadlock under a variety of loads both with and without network 
> involvement, making it effectively useless as a storage platform.
> 
> These deadlocks are first and foremost, block layer deficiencies.  Even 
> the network becomes part of the problem only because it lies in the 
> block IO path.

The block layer has NEVER guaranteed throttling, so it can - by
definition - not be a block layer deficiency.

-- 
Jens Axboe


^ permalink raw reply

* Re: Distributed storage.
From: Daniel Phillips @ 2007-08-13  9:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Evgeniy Polyakov, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <20070813074530.GE23758@kernel.dk>

On Monday 13 August 2007 00:45, Jens Axboe wrote:
> On Mon, Aug 13 2007, Jens Axboe wrote:
> > > You did not comment on the one about putting the bio destructor
> > > in the ->endio handler, which looks dead simple.  The majority of
> > > cases just use the default endio handler and the default
> > > destructor.  Of the remaining cases, where a specialized
> > > destructor is needed, typically a specialized endio handler is
> > > too, so combining is free.  There are few if any cases where a
> > > new specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
>
> But that idea fails as well, since reference counts and IO completion
> are two completely seperate entities. So unless end IO just happens
> to be the last user holding a reference to the bio, you cannot free
> it.

That is not a problem.  When bio_put hits zero it calls ->endio instead 
of the destructor.  The ->endio sees that the count is zero and 
destroys the bio.

Regards,

Daniel

^ permalink raw reply

* Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
From: Jan Engelhardt @ 2007-08-13  9:02 UTC (permalink / raw)
  To: david
  Cc: Paul Clements, Al Boldi, linux-kernel, linux-fsdevel, netdev,
	linux-raid
In-Reply-To: <Pine.LNX.4.64.0708122016100.22470@asgard.lang.hm>


On Aug 12 2007 20:21, david@lang.hm wrote:
>
> per the message below MD (or DM) would need to be modified to work
> reasonably well with one of the disk components being over an
> unreliable link (like a network link)

Does not dm-multipath do something like that?

> are the MD/DM maintainers interested in extending their code in this direction?
> or would they prefer to keep it simpler by being able to continue to assume
> that the raid components are connected over a highly reliable connection?

	Jan
-- 

^ permalink raw reply

* Re: Distributed storage.
From: Daniel Phillips @ 2007-08-13  8:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Evgeniy Polyakov, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <20070813072848.GC23758@kernel.dk>

On Monday 13 August 2007 00:28, Jens Axboe wrote:
> On Sun, Aug 12 2007, Daniel Phillips wrote:
> > Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
> > derive efficiently from BIO_POOL_IDX() provided the bio was
> > allocated in the standard way.
>
> That would only be feasible, if we ruled that any bio in the system
> must originate from the standard pools.

Not at all.

> > This leaves a little bit of clean up to do for bios not allocated
> > from a standard pool.
>
> Please suggest how to do such a cleanup.

Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
bio from the standard pool.  Just put the power of two size in the bits 
and map that number to the standard pool arrangement with a table 
lookup.

> > On the other hand, vm writeout deadlock ranks smack dab at the top
> > of the list, so that is where the patching effort must go for the
> > forseeable future.  Without bio throttling, the ddsnap load can go
> > to 24 MB for struct bio alone.  That definitely moves the needle. 
> > in short, we save 3,200 times more memory by putting decent
> > throttling in place than by saving an int in struct bio.
>
> Then fix the damn vm writeout. I always thought it was silly to
> depend on the block layer for any sort of throttling. If it's not a
> system wide problem, then throttle the io count in the
> make_request_fn handler of that problematic driver.

It is a system wide problem.  Every block device needs throttling, 
otherwise queues expand without limit.  Currently, block devices that 
use the standard request library get a slipshod form of throttling for 
free in the form of limiting in-flight request structs.  Because the 
amount of IO carried by a single request can vary by two orders of 
magnitude, the system behavior of this approach is far from 
predictable.

> > You did not comment on the one about putting the bio destructor in
> > the ->endio handler, which looks dead simple.  The majority of
> > cases just use the default endio handler and the default
> > destructor.  Of the remaining cases, where a specialized destructor
> > is needed, typically a specialized endio handler is too, so
> > combining is free.  There are few if any cases where a new
> > specialized endio handler would need to be written.
>
> We could do that without too much work, I agree.

OK, we got one and another is close to cracking, enough of that.

> > As far as code stability goes, current kernels are horribly
> > unstable in a variety of contexts because of memory deadlock and
> > slowdowns related to the attempt to fix the problem via dirty
> > memory limits.  Accurate throttling of bio traffic is one of the
> > two key requirements to fix this instability, the other other is
> > accurate writeout path reserve management, which is only partially
> > addressed by BIO_POOL.
>
> Which, as written above and stated many times over the years on lkml,
> is not a block layer issue imho.

Whoever stated that was wrong, but this should be no surprise.  There 
have been many wrong things said about this particular bug over the 
years.  The one thing that remains constant is, Linux continues to 
deadlock under a variety of loads both with and without network 
involvement, making it effectively useless as a storage platform.

These deadlocks are first and foremost, block layer deficiencies.  Even 
the network becomes part of the problem only because it lies in the 
block IO path.

Regards,

Daniel

^ permalink raw reply

* Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
From: david @ 2007-08-13  8:31 UTC (permalink / raw)
  To: David Greaves
  Cc: Paul Clements, Jan Engelhardt, Al Boldi, linux-kernel,
	linux-fsdevel, netdev, linux-raid
In-Reply-To: <46C01052.4050708@dgreaves.com>

On Mon, 13 Aug 2007, David Greaves wrote:

> david@lang.hm wrote:
>>  per the message below MD (or DM) would need to be modified to work
>>  reasonably well with one of the disk components being over an unreliable
>>  link (like a network link)
>>
>>  are the MD/DM maintainers interested in extending their code in this
>>  direction? or would they prefer to keep it simpler by being able to
>>  continue to assume that the raid components are connected over a highly
>>  reliable connection?
>>
>>  if they are interested in adding (and maintaining) this functionality then
>>  there is a real possibility that NBD+MD/DM could eliminate the need for
>>  DRDB. however if they are not interested in adding all the code to deal
>>  with the network type issues, then the argument that DRDB should not be
>>  merged becouse you can do the same thing with MD/DM + NBD is invalid and
>>  can be dropped/ignored
>>
>>  David Lang
>
> As a user I'd like to see md/nbd be extended to cope with unreliable links.
> I think md could be better in handling link exceptions. My unreliable memory 
> recalls sporadic issues with hot-plug leaving md hanging and certain lower 
> level errors (or even very high latency) causing unsatisfactory behaviour in 
> what is supposed to be a fault 'tolerant' subsystem.
>
>
> Would this just be relevant to network devices or would it improve support 
> for jostled usb and sata hot-plugging I wonder?

good question, I suspect that some of the error handling would be similar 
(for devices that are unreachable not haning the system for example), but 
a lot of the rest would be different (do you really want to try to 
auto-resync to a drive that you _think_ just reappeared, what if it's a 
different drive? how can you be sure?) the error rate of a network is gong 
to be significantly higher then for USB or SATA drives (although I suppose 
iscsi would be limilar)

David Lang

^ permalink raw reply

* Re: Block device throttling [Re: Distributed storage.]
From: Evgeniy Polyakov @ 2007-08-13  8:23 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jens Axboe, netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <200708122236.24096.phillips@phunq.net>

On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips (phillips@phunq.net) wrote:
> (previous incomplete message sent accidentally)
> 
> On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> >
> > So, what did we decide? To bloat bio a bit (add a queue pointer) or
> > to use physical device limits? The latter requires to replace all
> > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> > somthing_new), where queue limits will be appropriately charged. So
> > far I'm testing second case, but I only changed DST for testing, can
> > change all other users if needed though.
> 
> Adding a queue pointer to struct bio and using physical device limits as 
> in your posted patch both suffer from the same problem: you release the 
> throttling on the previous queue when the bio moves to a new one, which 
> is a bug because memory consumption on the previous queue then becomes 
> unbounded, or limited only by the number of struct requests that can be 
> allocated.  In other words, it reverts to the same situation we have 
> now as soon as the IO stack has more than one queue.  (Just a shorter 
> version of my previous post.)

No. Since all requests for virtual device end up in physical devices,
which have limits, this mechanism works. Virtual device will essentially 
call either generic_make_request() for new physical device (and thus
will sleep is limit is over), or will process bios directly, but in that
case it will sleep in generic_make_request() for virutal device.

> 1) One throttle count per submitted bio is too crude a measure.  A bio 
> can carry as few as one page or as many as 256 pages.  If you take only 

It does not matter - we can count bytes, pages, bio vectors or whatever
we like, its just a matter of counter and can be changed without
problem.

> 2) Exposing the per-block device throttle limits via sysfs or similar is 
> really not a good long term solution for system administration.  
> Imagine our help text: "just keep trying smaller numbers until your 
> system deadlocks".  We really need to figure this out internally and 
> get it correct.  I can see putting in a temporary userspace interface 
> just for experimentation, to help determine what really is safe, and 
> what size the numbers should be to approach optimal throughput in a 
> fully loaded memory state.

Well, we already have number of such 'supposed-to-be-automatic'
variables exported to userspace, so this will not change a picture,
frankly I do not care if there will or will not be any sysfs exported
tunable, eventually we can remove it or do not create at all.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [1/1] Block device throttling [Re: Distributed storage.]
From: Evgeniy Polyakov @ 2007-08-13  8:18 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jens Axboe, netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <200708121616.10670.phillips@phunq.net>

Hi Daniel.

On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips (phillips@phunq.net) wrote:
> Your patch is close to the truth, but it needs to throttle at the top 
> (virtual) end of each block device stack instead of the bottom 
> (physical) end.  It does head in the direction of eliminating your own 
> deadlock risk indeed, however there are block devices it does not 
> cover.

I decided to limit physical devices just because any limit on top of
virtual one is not correct. When system recharges bio from virtual
device to physical, and the latter is full, virtual device will not
accept any new blocks for that physical device, but can accept for
another ones. That was created specially to allow fair use for network
and physical storages.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: Block device throttling [Re: Distributed storage.]
From: Evgeniy Polyakov @ 2007-08-13  8:14 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jens Axboe, netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <200708122344.00343.phillips@phunq.net>

On Sun, Aug 12, 2007 at 11:44:00PM -0700, Daniel Phillips (phillips@phunq.net) wrote:
> On Sunday 12 August 2007 22:36, I wrote:
> > Note!  There are two more issues I forgot to mention earlier.
> 
> Oops, and there is also:
> 
> 3) The bio throttle, which is supposed to prevent deadlock, can itself 
> deadlock.  Let me see if I can remember how it goes.
> 
>   * generic_make_request puts a bio in flight
>   * the bio gets past the throttle and initiates network IO
>   * net calls sk_alloc->alloc_pages->shrink_caches
>   * shrink_caches submits a bio recursively to our block device
>   * this bio blocks on the throttle
>   * net may never get the memory it needs, and we are wedged

If system is in such condition, it is already broken - throttle limit
must be lowered (next time) not to allow such situation.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
From: David Greaves @ 2007-08-13  8:03 UTC (permalink / raw)
  To: david
  Cc: Paul Clements, Jan Engelhardt, Al Boldi, linux-kernel,
	linux-fsdevel, netdev, linux-raid
In-Reply-To: <Pine.LNX.4.64.0708122016100.22470@asgard.lang.hm>

david@lang.hm wrote:
> per the message below MD (or DM) would need to be modified to work 
> reasonably well with one of the disk components being over an unreliable 
> link (like a network link)
> 
> are the MD/DM maintainers interested in extending their code in this 
> direction? or would they prefer to keep it simpler by being able to 
> continue to assume that the raid components are connected over a highly 
> reliable connection?
> 
> if they are interested in adding (and maintaining) this functionality 
> then there is a real possibility that NBD+MD/DM could eliminate the need 
> for DRDB. however if they are not interested in adding all the code to 
> deal with the network type issues, then the argument that DRDB should 
> not be merged becouse you can do the same thing with MD/DM + NBD is 
> invalid and can be dropped/ignored
> 
> David Lang

As a user I'd like to see md/nbd be extended to cope with unreliable links.
I think md could be better in handling link exceptions. My unreliable memory 
recalls sporadic issues with hot-plug leaving md hanging and certain lower level 
errors (or even very high latency) causing unsatisfactory behaviour in what is 
supposed to be a fault 'tolerant' subsystem.


Would this just be relevant to network devices or would it improve support for 
jostled usb and sata hot-plugging I wonder?

David


^ permalink raw reply

* [PATCH] e100 module loads 1/2 times
From: Willy Tarreau @ 2007-08-13  7:54 UTC (permalink / raw)
  To: uke Kok, Brandeburg, Jesse; +Cc: netdev

Hi Auke, Jesse,

for a long time, I've been annoyed by version 3.5.17 of the e100 driver
which refuses to load on first time and only loads on second time. Since
I always had the original 2.3.43 driver in kernel 2.4, I did not care
that much. Recently, I encountered real troubles with 2.3.43 in a 802.1q
setup (basically it did not untag incoming frames). So I decided to give
3.5.17 a second try. Same problem, I had to load it twice.

I finally found the problem in e100_init_module(). Up to 3.5.14, the
return of pci_module_init() was returned. This one equals zero if
everything went fine, <0 otherwise, which is compatible with init_module().

With 3.5.17, the result comes from pci_register_driver(), which returns
the number of devices registered. So the problem now makes sense :

- first call: the driver registers itself and returns non-zero, which
  is an error for insmod

- second call: the driver cannot register again and returns zero new
  drivers, which is good for insmod.

Note that e1000 has a related bug : it uses the return from pci_module_init()
to decide whether or not to register a reboot notifier. Fortunately, the
test is performed with ret>=0, which matches ==0 and not <0 (errors), so
this works as a side effect.

The obvious fix for e100 reusing pci_module_init is below. Also, since
e100-2.3.43 does not work at all with vlans in 2.4, I was thinking about
upgrading it to 3.5.17. It would also be the same version as in 2.6,
simplifying its long-term maintenance. What do you think about this ?

Best regards,
Willy


--- e100-3.5.17/src/e100.c.orig	2007-08-13 08:53:18 +0200
+++ e100-3.5.17/src/e100.c	2007-08-13 09:24:56 +0200
@@ -2934,13 +2934,13 @@
 		printk(KERN_INFO PFX "%s\n", DRV_COPYRIGHT);
 	}
 #ifdef E100_USE_REBOOT_NOTIFIER
-	retval = pci_register_driver(&e100_driver);
-	if (retval >= 0)
+	retval = pci_module_init(&e100_driver);
+	if (retval == 0)
 		register_reboot_notifier(&e100_notifier_reboot);
 
 	return retval;
 #else
-	return pci_register_driver(&e100_driver);
+	return pci_module_init(&e100_driver);
 #endif
 }
 



^ permalink raw reply

* Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
From: David Greaves @ 2007-08-13  7:51 UTC (permalink / raw)
  To: Paul Clements
  Cc: Jan Engelhardt, david, Al Boldi, linux-kernel, linux-fsdevel,
	netdev, linux-raid
In-Reply-To: <46BFB6BB.80406@steeleye.com>

Paul Clements wrote:
> Well, if people would like to see a timeout option, I actually coded up 
> a patch a couple of years ago to do just that, but I never got it into 
> mainline because you can do almost as well by doing a check at 
> user-level (I basically ping the nbd connection periodically and if it 
> fails, I kill -9 the nbd-client).


Yes please.

David


^ permalink raw reply

* Re: ATA over ethernet swapping
From: Peter Zijlstra @ 2007-08-13  7:45 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Ed L. Cashin, kernel list, ak, Netdev list
In-Reply-To: <20070809101105.GA7581@elf.ucw.cz>

On Thu, 2007-08-09 at 12:11 +0200, Pavel Machek wrote:
> Hi!
> 
> > I've been working on this for quite some time. And should post again
> > soon. Please see the patches:
> > 
> >   http://programming.kicks-ass.net/kernel-patches/vm_deadlock/current/
> > 
> > For now it requires one uses SLUB, I hope that SLAB will go away (will
> > save me the trouble of adding support) and I guess I ought to do SLOB
> > some time (if that does stay).
> > 
> > You'd need the first 22 patches of that series, and then call
> > sk_set_memalloc(sk) on the proper socket, and do some fiddling with the
> > reconnect logic. See nfs-swapfile.patch for examples.
> 
> What do you use for testing? I set up ata over ethernet... swapping
> over that should deadlock w/o your patches.
> 
> But I'm able to compile kernel (-j 10) on 128MB machine, and I tried
> cat /dev/zero | grep foo to exhaust memory... and could not reproduce
> the deadlock. Should I pingflood? Tweak down ammount of atomic memory
> avaialable to make deadlocks easier to reproduce?

I usually test swap over NFS in the following manner, I setup a regular
inet service on the machine (apache or a bunch of ncat sockets piping to
files or something) and run a heavy workload on the machine (128M):
2*64M file backed thrashers and 2*64M anonymous thrashers. Then I start
clients for the regular inet service, wait for a bit, and shut down the
NFS server.

This makes the machine grind to a halt, I then restart the NFS server,
wait for it to reconnect and the client to come alive again.

Without the last few swap-over-NFS patches this last bit - getting back
out of that situation - never happens.

The basic idea is to make connectivity to the machine where swap traffic
goes very hard (pull a cable, cleanly shut down the server) and to keep
other network traffic pounding the machine.


^ permalink raw reply

* Re: Distributed storage.
From: Jens Axboe @ 2007-08-13  7:45 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Evgeniy Polyakov, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <20070813072848.GC23758@kernel.dk>

On Mon, Aug 13 2007, Jens Axboe wrote:
> > You did not comment on the one about putting the bio destructor in 
> > the ->endio handler, which looks dead simple.  The majority of cases 
> > just use the default endio handler and the default destructor.  Of the 
> > remaining cases, where a specialized destructor is needed, typically a 
> > specialized endio handler is too, so combining is free.  There are few 
> > if any cases where a new specialized endio handler would need to be 
> > written.
> 
> We could do that without too much work, I agree.

But that idea fails as well, since reference counts and IO completion
are two completely seperate entities. So unless end IO just happens to
be the last user holding a reference to the bio, you cannot free it.

-- 
Jens Axboe


^ permalink raw reply

* Re: Distributed storage.
From: Jens Axboe @ 2007-08-13  7:28 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Evgeniy Polyakov, netdev, linux-kernel, linux-fsdevel,
	Peter Zijlstra
In-Reply-To: <200708121637.00144.phillips@phunq.net>

On Sun, Aug 12 2007, Daniel Phillips wrote:
> On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> > I don't like structure bloat, but I do like nice design. Overloading
> > is a necessary evil sometimes, though. Even today, there isn't enough
> > room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> > so that concern can be scratched. If you read bio.h, that much is
> > obvious.
> 
> Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
> this lives in struct bio instead of struct request?

If you don't, you have to pass them down. You can make that very
statement about basically any member of struct bio, until we end up with
a submit_bio() path and down taking 16 arguments.

> > If you check up on the iommu virtual merging, you'll understand the
> > front and back size members. They may smell dubious to you, but
> > please take the time to understand why it looks the way it does.
> 
> Virtual merging is only needed at the physical device, so why do these 
> fields live in struct bio instead of struct request?

A bio does exist outside of a struct request, and bio buildup also
happens before it gets attached to such.

> > Changing the number of bvecs is integral to how bio buildup current
> > works.
> 
> Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
> derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
> in the standard way.

That would only be feasible, if we ruled that any bio in the system must
originate from the standard pools.

> This leaves a little bit of clean up to do for bios not allocated from
> a standard pool.

Please suggest how to do such a cleanup.

> Incidentally, why does the bvl need to be memset to zero on allocation?  
> bi_vcnt already tells you which bvecs are valid and the only field in a 
> bvec that can reasonably default to zero is the offset, which ought to 
> be set set every time a bvec is initialized anyway.

We could probably skip that, but that's an unrelated subject.

> > > bi_destructor could be combined.  I don't see a lot of users of
> > > bi_idx,
> >
> > bi_idx is integral to partial io completions.
> 
> Struct request has a remaining submission sector count so what does 
> bi_idx do that is different?

Struct request has remaining IO count. You still need to know where to
start in the bio.

> > > that looks like a soft target.  See what happened to struct page
> > > when a couple of folks got serious about attacking it, some really
> > > deep hacks were done to pare off a few bytes here and there.  But
> > > struct bio as a space waster is not nearly in the same ballpark.
> >
> > So show some concrete patches and examples, hand waving and
> > assumptions is just a waste of everyones time.
> 
> Average struct bio memory footprint ranks near the bottom of the list of 
> things that suck most about Linux storage.  At idle I see 8K in use 
> (reserves); during updatedb it spikes occasionally to 50K; under a 
> heavy  load generated by ddsnap on a storage box it sometimes goes to 
> 100K with bio throttling in place.  Really not moving the needle.

Then, again, stop wasting time on this subject. Just because struct bio
isn't a huge bloat is absolutely no justification for adding extra
members to it. It's not just about system wide bloat.

> On the other hand, vm writeout deadlock ranks smack dab at the top of 
> the list, so that is where the patching effort must go for the 
> forseeable future.  Without bio throttling, the ddsnap load can go to 
> 24 MB for struct bio alone.  That definitely moves the needle.  in 
> short, we save 3,200 times more memory by putting decent throttling in 
> place than by saving an int in struct bio.

Then fix the damn vm writeout. I always thought it was silly to depend
on the block layer for any sort of throttling. If it's not a system wide
problem, then throttle the io count in the make_request_fn handler of
that problematic driver.

> That said, I did a little analysis to get an idea of where the soft 
> targets are in struct bio, and to get to know the bio layer a little 
> better.  Maybe these few hints will get somebody interested enough to 
> look further.
> 
> > > It would be interesting to see if bi_bdev could be made read only.
> > > Generally, each stage in the block device stack knows what the next
> > > stage is going to be, so why do we have to write that in the bio? 
> > > For error reporting from interrupt context?  Anyway, if Evgeniy
> > > wants to do the patch, I will happily unload the task of convincing
> > > you that random fields are/are not needed in struct bio :-)
> >
> > It's a trade off, otherwise you'd have to pass the block device
> > around a lot.
> 
> Which costs very little, probably less than trashing an extra field's 
> worth of cache.

Again, you can make that argument for most of the members. It's a
non-starter.

> > And it's, again, a design issue. A bio contains 
> > destination information, that means device/offset/size information.
> > I'm all for shaving structure bytes where it matters, but not for the
> > sake of sacrificing code stability or design. I consider struct bio
> > quite lean and have worked hard to keep it that way. In fact, iirc,
> > the only addition to struct bio since 2001 is the iommu front/back
> > size members. And I resisted those for quite a while.
> 
> You did not comment on the one about putting the bio destructor in 
> the ->endio handler, which looks dead simple.  The majority of cases 
> just use the default endio handler and the default destructor.  Of the 
> remaining cases, where a specialized destructor is needed, typically a 
> specialized endio handler is too, so combining is free.  There are few 
> if any cases where a new specialized endio handler would need to be 
> written.

We could do that without too much work, I agree.

> As far as code stability goes, current kernels are horribly unstable in 
> a variety of contexts because of memory deadlock and slowdowns related 
> to the attempt to fix the problem via dirty memory limits.  Accurate 
> throttling of bio traffic is one of the two key requirements to fix 
> this instability, the other other is accurate writeout path reserve 
> management, which is only partially addressed by BIO_POOL.

Which, as written above and stated many times over the years on lkml, is
not a block layer issue imho.

> Nice to see you jumping in Jens.  Now it is over to the other side of 
> the thread where Evgeniy has posted a patch that a) grants your wish to 
> add no new field in struct bio and b) does not fix the problem.

This is why it's impossible for me to have any sort of constructive
conversation with you.

-- 
Jens Axboe


^ permalink raw reply

* Re: dm9000: add set_mac_address()
From: Michael Trimarchi @ 2007-08-13  7:24 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev
In-Reply-To: <46BF9304.9080709@pobox.com>


>
>     probe time:
>     read MAC address from SROM
>
>     dev open (interface up):
>     write dev->dev_addr[] to RX filter (or identity) registers
>
> EEPROM update support is available separately, via an ethtool sub-ioctl.
>
>     Jeff
>
ok

regards
michael


^ permalink raw reply

* Re: [PATCH] [374/2many] MAINTAINERS - PCNET32 NETWORK DRIVER
From: Joe Perches @ 2007-08-13  7:22 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, pcnet32, netdev, linux-kernel, akpm
In-Reply-To: <20070813.001825.22019983.davem@davemloft.net>

On Mon, 2007-08-13 at 00:18 -0700, David Miller wrote:
> The posting limit is 400K for linux-kernel, netdev, and one
> or two of the other lists.

Apologies.  Posted it twice over 2 days.
Anyway, I supposed you could kill the spool entries if you want.

cheers,  Joe


^ permalink raw reply

* Re: [RFC PATCH v0.1] net driver: mpc52xx fec
From: Domen Puncer @ 2007-08-13  7:21 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, linuxppc-embedded, netdev
In-Reply-To: <20070810130225.GE3375@ghostprotocols.net>

On 10/08/07 10:02 -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Aug 10, 2007 at 11:51:53AM +0200, Domen Puncer escreveu:
> > +static u8 null_mac[6];
> 
> const

OK.

...
> > +static void fec_set_paddr(struct net_device *dev, u8 *mac)
> > +{
> > +	struct fec_priv *priv = netdev_priv(dev);
> > +	struct mpc52xx_fec __iomem *fec = priv->fec;
> > +
> > +	out_be32(&fec->paddr1, *(u32*)(&mac[0]));
> > +	out_be32(&fec->paddr2, (*(u16*)(&mac[4]) << 16) | FEC_PADDR2_TYPE);
> 
> spaces after the types on casts to pointers

I assume you mean: out_be32(&fec->paddr1, *(u32 *)(&mac[0]));

> 
> > +}
> > +
> > +static void fec_get_paddr(struct net_device *dev, u8 *mac)
> > +{
> > +	struct fec_priv *priv = netdev_priv(dev);
> > +	struct mpc52xx_fec __iomem *fec = priv->fec;
> > +
> > +	*(u32*)(&mac[0]) = in_be32(&fec->paddr1);
> > +	*(u16*)(&mac[4]) = in_be32(&fec->paddr2) >> 16;
> 
> ditto

OK.

> 
> > +}
> > +
> > +static int fec_set_mac_address(struct net_device *dev, void *addr)
> > +{
> > +	struct sockaddr *sock = (struct sockaddr *)addr;
> 
> no need for a cast, addr is a void pointer

Right, missed this one.

> 
> > +
> > +	memcpy(dev->dev_addr, sock->sa_data, dev->addr_len);
> > +
> > +	fec_set_paddr(dev, sock->sa_data);
> > +	return 0;
> 
> Why always return 0? make it void

Because struct net_device->set_mac_address's prototype is like that.

> 
> > +}
> > +
> > +static void fec_free_rx_buffers(struct bcom_task *s)
> > +{
> > +	struct sk_buff *skb;
> > +
> > +	while (!bcom_queue_empty(s)) {
> > +		skb = bcom_retrieve_buffer(s, NULL, NULL);
> > +		kfree_skb(skb);
> > +	}
> > +}
> > +
> > +static int fec_alloc_rx_buffers(struct bcom_task *rxtsk)
> > +{
> > +	while (!bcom_queue_full(rxtsk)) {
> > +		struct sk_buff *skb;
> > +		struct bcom_fec_bd *bd;
> > +
> > +		skb = dev_alloc_skb(FEC_RX_BUFFER_SIZE);
> > +		if (skb == 0)
> 
> Test against NULL

Right. I also fixed other stuff sparse reports.

> 
> > +			return -EAGAIN;
> > +

...
> > +
> > +	if (new_state && netif_msg_link(priv)) {
> > +		phy_print_status(phydev);
> > +	}
> 
> No need for {}, this if has only one statement

OK.

...
> > +static int __init
> > +mpc52xx_fec_init(void)
> > +{
> > +	int ret;
> > +	if ((ret = fec_mdio_init())) {
> 
> Why not:
> 
> 	int ret = fec_mdio_init();
> 	if (ret) {
> 
> Less parenthesis, looks more clear

I prefer it like that too.


Thanks Arnaldo!


Any other comments on code functionality, design?


	Domen

^ permalink raw reply

* Re: [PATCH] [374/2many] MAINTAINERS - PCNET32 NETWORK DRIVER
From: David Miller @ 2007-08-13  7:18 UTC (permalink / raw)
  To: joe; +Cc: torvalds, pcnet32, netdev, linux-kernel, akpm
In-Reply-To: <1186987609.10249.22.camel@localhost>

From: Joe Perches <joe@perches.com>
Date: Sun, 12 Aug 2007 23:46:49 -0700

> I tried to send 1 patch over the last couple of days.
> Unfortunately, it's > 100KB and disappears into the void.

The posting limit is 400K for linux-kernel, netdev, and one
or two of the other lists.

^ permalink raw reply

* Re: [PATCH] [374/2many] MAINTAINERS - PCNET32 NETWORK DRIVER
From: Al Viro @ 2007-08-13  7:18 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, torvalds, pcnet32, netdev, linux-kernel, akpm
In-Reply-To: <1186987609.10249.22.camel@localhost>

On Sun, Aug 12, 2007 at 11:46:49PM -0700, Joe Perches wrote:
> On Sun, 2007-08-12 at 23:36 -0700, David Miller wrote:
> > Ok, 374 patches is just rediculious.
> > 
> > So many patches eats up an enormous amount of mailing list resources,
> > and for these patches in particular there are few reasons to split
> > them up at all.  The fact that the split up landed you at 300+ patches
> > is a very good indication of that.
> 
> More than ridiculous.  Completely agree.
> 
> I tried to send 1 patch over the last couple of days.
> Unfortunately, it's > 100KB and disappears into the void.

So put it on anonftp/webpage/git tree and post the URI, damnit.
Of all ridiculous reasons for a mailbomb...

^ permalink raw reply

* Re: 2.6.23-rc2: WARNING: at kernel/irq/resend.c:70 check_irq_resend()
From: Marcin Ślusarz @ 2007-08-13  7:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, John Stoffel, linux-kernel,
	shemminger, vignaud, torvalds, akpm, alan, linux-net, netdev
In-Reply-To: <20070810101641.GA31396@elte.hu>

2007/8/10, Ingo Molnar <mingo@elte.hu>:
> Index: linux/arch/i386/kernel/io_apic.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/io_apic.c
> +++ linux/arch/i386/kernel/io_apic.c
> @@ -735,7 +735,8 @@ void fastcall send_IPI_self(int vector)
>          * Wait for idle.
>          */
>         apic_wait_icr_idle();
> -       cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL;
> +       cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL |
> +               APIC_INT_LEVELTRIG;
>         /*
>          * Send the IPI. The write to APIC_ICR fires this off.
>          */
> Index: linux/arch/x86_64/kernel/genapic.c
> ===================================================================
> --- linux.orig/arch/x86_64/kernel/genapic.c
> +++ linux/arch/x86_64/kernel/genapic.c
> @@ -62,5 +62,6 @@ void __init setup_apic_routing(void)
>
>  void send_IPI_self(int vector)
>  {
> -       __send_IPI_shortcut(APIC_DEST_SELF, vector, APIC_DEST_PHYSICAL);
> +       __send_IPI_shortcut(APIC_DEST_SELF, vector, APIC_DEST_PHYSICAL |
> +                           APIC_INT_LEVELTRIG);
>  }
>
network card timed out as usual ;)

Marcin

^ permalink raw reply

* Re: [PATCH 6/24] make atomic_read() behave consistently on frv
From: Chris Snook @ 2007-08-13  6:44 UTC (permalink / raw)
  To: David Howells
  Cc: linux-kernel, linux-arch, torvalds, netdev, akpm, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl
In-Reply-To: <7680.1186822071@redhat.com>

David Howells wrote:
> Chris Snook <csnook@redhat.com> wrote:
> 
>> cpu_relax() contains a barrier, so it should do the right thing.  For non-smp
>> architectures, I'm concerned about interacting with interrupt handlers.  Some
>> drivers do use atomic_* operations.
> 
> I'm not sure that actually answers my question.  Why not smp_rmb()?
> 
> David

I would assume because we want to waste time efficiently even on non-smp 
architectures, rather than frying the CPU or draining the battery.  Certain 
looping execution patterns can cause the CPU to operate above thermal design 
power.  I have fans on my workstation that only ever come on when running 
LINPACK, and that's generally memory bandwidth-bound.  Just imagine what happens 
when you're executing the same few non-serializing instructions in a tight loop 
without ever stalling on memory fetches, or being scheduled out.

If there's another reason, I'd like to hear it too, because I'm just guessing here.

	-- Chris

^ permalink raw reply

* Re: [PATCH] [374/2many] MAINTAINERS - PCNET32 NETWORK DRIVER
From: Joe Perches @ 2007-08-13  6:46 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, pcnet32, netdev, linux-kernel, akpm
In-Reply-To: <20070812.233630.00455011.davem@davemloft.net>

On Sun, 2007-08-12 at 23:36 -0700, David Miller wrote:
> Ok, 374 patches is just rediculious.
> 
> So many patches eats up an enormous amount of mailing list resources,
> and for these patches in particular there are few reasons to split
> them up at all.  The fact that the split up landed you at 300+ patches
> is a very good indication of that.

More than ridiculous.  Completely agree.

I tried to send 1 patch over the last couple of days.
Unfortunately, it's > 100KB and disappears into the void.

How about 10 patches or so?

What about maintainer sign offs?

Personally, I don't think it's necessary, but for
the subsystem maintainers like you, if or when the
get_maintainers.pl get integrated into patch
generation mechanisms, you might get more emails.
Perhaps more than you want.

Suggestions?

One good thing by emailing all the listed maintainers,
I've gotten several bounces for invalid addresses.


^ permalink raw reply

* Re: Block device throttling [Re: Distributed storage.]
From: Daniel Phillips @ 2007-08-13  6:44 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jens Axboe, netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <200708122236.24096.phillips@phunq.net>

On Sunday 12 August 2007 22:36, I wrote:
> Note!  There are two more issues I forgot to mention earlier.

Oops, and there is also:

3) The bio throttle, which is supposed to prevent deadlock, can itself 
deadlock.  Let me see if I can remember how it goes.

  * generic_make_request puts a bio in flight
  * the bio gets past the throttle and initiates network IO
  * net calls sk_alloc->alloc_pages->shrink_caches
  * shrink_caches submits a bio recursively to our block device
  * this bio blocks on the throttle
  * net may never get the memory it needs, and we are wedged

I need to review a backtrace to get this precisely right, however you 
can see the danger.  In ddsnap we kludge around this problem by not 
throttling any bio submitted in PF_MEMALLOC mode, which effectively 
increases our reserve requirement by the amount of IO that mm will 
submit to a given block device before deciding the device is congested 
and should be left alone.  This works, but is sloppy and disgusting.

The right thing to do is to make sure than the mm knows about our 
throttle accounting in backing_dev_info so it will not push IO to our 
device when it knows that the IO will just block on congestion.  
Instead, shrink_caches will find some other less congested block device 
or give up, causing alloc_pages to draw from the memalloc reserve to 
satisfy the sk_alloc request.

The mm already uses backing_dev_info this way, we just need to set the 
right bits in the backing_dev_info state flags.  I think Peter posted a 
patch set that included this feature at some point.

Regards,

Daniel

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox