* Re: Distributed storage.
From: Peter Zijlstra @ 2007-08-03 12:27 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Daniel Phillips, netdev, linux-kernel, linux-fsdevel,
Arnaldo Carvalho de Melo
In-Reply-To: <20070803105747.GE10089@2ka.mipt.ru>
On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:
> For receiving situation is worse, since system does not know in advance
> to which socket given packet will belong to, so it must allocate from
> global pool (and thus there must be independent global reserve), and
> then exchange part of the socket's reserve to the global one (or just
> copy packet to the new one, allocated from socket's reseve is it was
> setup, or drop it otherwise). Global independent reserve is what I
> proposed when stopped to advertise network allocator, but it seems that
> it was not taken into account, and reserve was always allocated only
> when system has serious memory pressure in Peter's patches without any
> meaning for per-socket reservation.
This is not true. I have a global reserve which is set-up a priori. You
cannot allocate a reserve when under pressure, that does not make sense.
Let me explain my approach once again.
At swapon(8) time we allocate a global reserve. And associate the needed
sockets with it. The size of this global reserve is make up of two
parts:
- TX
- RX
The RX pool is the most interresting part. It again is made up of two
parts:
- skb
- auxilary data
The skb part is scaled such that it can overflow the IP fragment
reassembly, the aux pool such that it can overflow the route cache (that
was the largest other allocator in the RX path)
All (reserve) RX skb allocations are accounted, so as to never allocate
more than we reserved.
All packets are received (given the limit) and are processed up to
socket demux. At that point all packets not targeted at an associated
socket are dropped and the skb memory freed - ready for another packet.
All packets targeted for associated sockets get processed. This requires
that this packet processing happens in-kernel. Since we are swapping
user-space might be waiting for this data, and we'd deadlock.
I'm not quite sure why you need per socket reservations.
^ permalink raw reply
* Re: [patch] genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski @ 2007-08-03 12:26 UTC (permalink / raw)
To: Marcin Ślusarz
Cc: Ingo Molnar, Gabriel C, Linus Torvalds, Thomas Gleixner,
Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
netdev, Andrew Morton, Alan Cox
In-Reply-To: <4bacf17f0708030457i8d5c32xcd1ec4a0640f3822@mail.gmail.com>
On Fri, Aug 03, 2007 at 01:57:00PM +0200, Marcin Ślusarz wrote:
...
> I'll test this patch tomorrow (and confirm that the last one from Ingo
> works fine) and report results on monday (sorry, no internet at home
> since I moved out of city :|).
So, you are a lucky guy! I have only no internet at home.
...and time for dreaming about moving out of a city...
Cheers,
Jarek P.
^ permalink raw reply
* Re: strange tcp behavior
From: Evgeniy Polyakov @ 2007-08-03 12:09 UTC (permalink / raw)
To: Simon Arlott; +Cc: john, netdev, David Miller
In-Reply-To: <60580.simon.1186142626@5ec7c279.invalid>
On Fri, Aug 03, 2007 at 01:03:46PM +0100, Simon Arlott (simon@fire.lp0.eu) wrote:
> On Fri, August 3, 2007 12:56, Evgeniy Polyakov wrote:
> > On Fri, Aug 03, 2007 at 12:21:46PM +0100, Simon Arlott (simon@fire.lp0.eu) wrote:
> >> Since the connection is considered closed, couldn't another socket re-use it?
> >>
> >> Socket A: Recv data (unread)
> >> Socket A: Recv RST
> >> Socket B: Reuses connection (same IPs/ports)
> >> Socket A: Close
> >>
> >> Wouldn't that disrupt socket B's use of the connection?
> >
> > Then it will drop our data, since there were no appropriate handhsake.
>
> Couldn't the sequence numbers be close enough to make the RST valid?
It does not matter - if connection is not in synchronized state all
unrelated data is dropped, so remote side is only allowed to receive syn
flag only, anything else must be dropped. If remote side does not do
that, it violates RFC.
> --
> Simon Arlott
--
Evgeniy Polyakov
^ permalink raw reply
* Re: strange tcp behavior
From: Simon Arlott @ 2007-08-03 12:03 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: john, netdev, David Miller
In-Reply-To: <20070803115624.GD5727@2ka.mipt.ru>
On Fri, August 3, 2007 12:56, Evgeniy Polyakov wrote:
> On Fri, Aug 03, 2007 at 12:21:46PM +0100, Simon Arlott (simon@fire.lp0.eu) wrote:
>> Since the connection is considered closed, couldn't another socket re-use it?
>>
>> Socket A: Recv data (unread)
>> Socket A: Recv RST
>> Socket B: Reuses connection (same IPs/ports)
>> Socket A: Close
>>
>> Wouldn't that disrupt socket B's use of the connection?
>
> Then it will drop our data, since there were no appropriate handhsake.
Couldn't the sequence numbers be close enough to make the RST valid?
--
Simon Arlott
^ permalink raw reply
* Re: strange tcp behavior
From: Evgeniy Polyakov @ 2007-08-03 11:56 UTC (permalink / raw)
To: Simon Arlott; +Cc: john, netdev, David Miller
In-Reply-To: <46956.simon.1186140106@5ec7c279.invalid>
On Fri, Aug 03, 2007 at 12:21:46PM +0100, Simon Arlott (simon@fire.lp0.eu) wrote:
> Since the connection is considered closed, couldn't another socket re-use it?
>
> Socket A: Recv data (unread)
> Socket A: Recv RST
> Socket B: Reuses connection (same IPs/ports)
> Socket A: Close
>
> Wouldn't that disrupt socket B's use of the connection?
Then it will drop our data, since there were no appropriate handhsake.
> --
> Simon Arlott
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [patch] genirq: fix simple and fasteoi irq handlers
From: Marcin Ślusarz @ 2007-08-03 11:57 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Ingo Molnar, Gabriel C, Linus Torvalds, Thomas Gleixner,
Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
netdev, Andrew Morton, Alan Cox
In-Reply-To: <20070803091008.GC1582@ff.dom.local>
2007/8/3, Jarek Poplawski <jarkao2@o2.pl>:
> On Fri, Aug 03, 2007 at 10:04:08AM +0200, Ingo Molnar wrote:
> >
> > * Jarek Poplawski <jarkao2@o2.pl> wrote:
> >
> > > I can't guarantee this is all needed to fix this bug, but I think this
> > > patch is necessary here.
> >
> > hmmm ... very interesting! Now _this_ is something we'd like to see
> > tested. Could you send a patch to Marcin that also undoes the workaround
> > we have in place now, so that he could check whether ne2k-pci works fine
> > with your fix alone?
>
> I'm not sure this is needed... Marcin got this patch, I hope, and I
> don't have another possibility to contact with him. Since he managed
> with this bisection and all the previous patches I don't think there
> could be any problems, so:
>
> Marcin! I'd be very glad if you could test this patch alone; this
> should apply without any problems to 2.6.21 (with some offset) and
> later "vanilla" versions (or try to revert Ingo's last patch with
> patch -p1 -R). Please, contact me on any problems (alas not during
> the weekend...).
I'll test this patch tomorrow (and confirm that the last one from Ingo
works fine) and report results on monday (sorry, no internet at home
since I moved out of city :|).
Marcin
^ permalink raw reply
* [patch]support for USB autosuspend in the asix driver
From: Oliver Neukum @ 2007-08-03 11:52 UTC (permalink / raw)
To: jgarzik, netdev, David Brownell
Hi,
this implements support for USB autosuspend in the asix USB ethernet
driver.
Regards
Oliver
Signed-off-by: Oliver Neukum <oneukum@suse.de>
---
--- a/drivers/net/usb/asix.c 2007-08-03 13:16:31.000000000 +0200
+++ b/drivers/net/usb/asix.c 2007-08-03 13:17:05.000000000 +0200
@@ -1474,6 +1474,7 @@ static struct usb_driver asix_driver = {
.suspend = usbnet_suspend,
.resume = usbnet_resume,
.disconnect = usbnet_disconnect,
+ .supports_autosuspend = 1,
};
static int __init asix_init(void)
--- a/drivers/net/usb/usbnet.c 2007-08-03 13:16:53.000000000 +0200
+++ b/drivers/net/usb/usbnet.c 2007-08-03 13:19:31.000000000 +0200
@@ -588,6 +588,7 @@ static int usbnet_stop (struct net_devic
dev->flags = 0;
del_timer_sync (&dev->delay);
tasklet_kill (&dev->bh);
+ usb_autopm_put_interface(dev->intf);
return 0;
}
@@ -601,9 +602,19 @@ static int usbnet_stop (struct net_devic
static int usbnet_open (struct net_device *net)
{
struct usbnet *dev = netdev_priv(net);
- int retval = 0;
+ int retval;
struct driver_info *info = dev->driver_info;
+ if ((retval = usb_autopm_get_interface(dev->intf)) < 0) {
+ if (netif_msg_ifup (dev))
+ devinfo (dev,
+ "resumption fail (%d) usbnet usb-%s-%s, %s",
+ retval,
+ dev->udev->bus->bus_name, dev->udev->devpath,
+ info->description);
+ goto done_nopm;
+ }
+
// put into "known safe" state
if (info->reset && (retval = info->reset (dev)) < 0) {
if (netif_msg_ifup (dev))
@@ -657,7 +668,10 @@ static int usbnet_open (struct net_devic
// delay posting reads until we're fully open
tasklet_schedule (&dev->bh);
+ return retval;
done:
+ usb_autopm_put_interface(dev->intf);
+done_nopm:
return retval;
}
@@ -1141,6 +1155,7 @@ usbnet_probe (struct usb_interface *udev
dev = netdev_priv(net);
dev->udev = xdev;
+ dev->intf = udev;
dev->driver_info = info;
dev->driver_name = name;
dev->msg_enable = netif_msg_init (msg_level, NETIF_MSG_DRV
@@ -1265,12 +1280,18 @@ int usbnet_suspend (struct usb_interface
struct usbnet *dev = usb_get_intfdata(intf);
if (!dev->suspend_count++) {
- /* accelerate emptying of the rx and queues, to avoid
+ /*
+ * accelerate emptying of the rx and queues, to avoid
* having everything error out.
*/
netif_device_detach (dev->net);
(void) unlink_urbs (dev, &dev->rxq);
(void) unlink_urbs (dev, &dev->txq);
+ /*
+ * reattach so runtime management can use and
+ * wake the device
+ */
+ netif_device_attach (dev->net);
}
return 0;
}
@@ -1280,10 +1301,9 @@ int usbnet_resume (struct usb_interface
{
struct usbnet *dev = usb_get_intfdata(intf);
- if (!--dev->suspend_count) {
- netif_device_attach (dev->net);
+ if (!--dev->suspend_count)
tasklet_schedule (&dev->bh);
- }
+
return 0;
}
EXPORT_SYMBOL_GPL(usbnet_resume);
--- a/drivers/net/usb/usbnet.h 2007-08-03 13:16:44.000000000 +0200
+++ b/drivers/net/usb/usbnet.h 2007-08-03 13:17:05.000000000 +0200
@@ -28,6 +28,7 @@
struct usbnet {
/* housekeeping */
struct usb_device *udev;
+ struct usb_interface *intf;
struct driver_info *driver_info;
const char *driver_name;
wait_queue_head_t *wait;
^ permalink raw reply
* Re: [patch 0/5][RFC] Update network drivers to use devres
From: Tejun Heo @ 2007-08-03 11:33 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Brandon Philips, netdev, teheo
In-Reply-To: <20070803120745.2d89c221@oldman.hamilton.local>
Hello,
Stephen Hemminger wrote:
>> Skimming through drivers... via-rhine doesn't disable PCI device on
>> init failure path but does so on removal. sky2 doesn't free
>> consistent memory if sky2_init() fails. acenic calls iounmap() with
>> NULL parameter which I'm not sure whether it's safe or not. natsemi
>> doesn't disable PCI device on failure or removal.
>
> Did you report these to the developers?
Just skimmed through. I'm pretty sure Brandon will pick those up later.
>> Devres makes low level drivers simpler, easier to get right and
>> maintain. Writing new drivers becomes easier too. So, why not?
>>
>>> Network devices seem to work fine thanks, and the resource requirements
>>> are different. If ain't broke, don't fix it.
>> Care to enlighten me on how the resource requirments are different
>> from ATA drivers?
>
> I was thinking of the hot remove (no mod ref counts) and lingering
> /sys open issues. ATA drivers use ref counts.
I guess the hot removing is done by severing netdev from the actual
device, right? I don't see how that affects usage of devres on network
drivers. Am I missing something?
On a separate note, can you explain lingering /sys open issue to me a
bit? With recent sysfs changes, sysfs nodes are disconnected
immediately on deletion. Would that make any difference to netdevs?
> My take on devres is that it is similar to talloc() for device drivers.
> Not a bad idea in itself, but the real advantage of hierarchical allocation
> is that it makes exception handling easier if things are layered deeply.
Yeah, devres made layering easier in libata, especially SFF stuff.
Dunno how much of that is applicable to netdev but, with or without
layering, it'll be a nice cleanup and I don't see much negative side.
Conversion would take some work and bugs might be introduced in the
process as with any changes but the good thing about devres is that
you're very likely to get failure/release paths right if you get the
init path right, and if you get the init path wrong, it will stand out
like a sore thumb - easy to spot, easy to fix.
So, I think using devres on net drivers is a good idea, well, for that
matter, for any driver, but me being the devres writer, that isn't
really surprising, is it?
Thanks.
--
tejun
^ permalink raw reply
* Re: strange tcp behavior
From: Simon Arlott @ 2007-08-03 11:21 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: john, netdev, David Miller
In-Reply-To: <20070803082517.GB25582@2ka.mipt.ru>
On Fri, August 3, 2007 09:25, Evgeniy Polyakov wrote:
> On Thu, Aug 02, 2007 at 07:58:03PM +0100, Simon Arlott (simon@fire.lp0.eu) wrote:
>> 19:24:32.897071 IP 192.168.7.4.50000 > 192.168.7.8.2500: S 705362199:705362199(0) win 1500
>> 19:24:32.897211 IP 192.168.7.8.2500 > 192.168.7.4.50000: S 4159455228:4159455228(0) ack 705362200 win
>> 14360 <mss 7180>
>> 19:24:32.920784 IP 192.168.7.4.50000 > 192.168.7.8.2500: . ack 1 win 1500
>> 19:24:32.921732 IP 192.168.7.4.50000 > 192.168.7.8.2500: P 1:17(16) ack 1 win 1500
>> 19:24:32.921795 IP 192.168.7.8.2500 > 192.168.7.4.50000: . ack 17 win 14360
>> 19:24:32.922881 IP 192.168.7.4.50000 > 192.168.7.8.2500: R 705362216:705362216(0) win 1500
>> 19:24:34.927717 IP 192.168.7.8.2500 > 192.168.7.4.50000: R 1:1(0) ack 17 win 14360
>>
>> According to RFC 793, the RST from .4 means that the connection
>> is CLOSED.
>
> RFC 2525 - common tcp problems, says we should send RST in this case,
> although it does not specify should we send it if socket is in CLOSED
> state or not. Well, we send :)
> Even if tcp_send_active_reset() will check if socket is in CLOSED state
> and will not send data, but is still there, it will not be easily
> triggered though, but it can be possible.
Since the connection is considered closed, couldn't another socket re-use it?
Socket A: Recv data (unread)
Socket A: Recv RST
Socket B: Reuses connection (same IPs/ports)
Socket A: Close
Wouldn't that disrupt socket B's use of the connection?
--
Simon Arlott
^ permalink raw reply
* Re: [patch 0/5][RFC] Update network drivers to use devres
From: Stephen Hemminger @ 2007-08-03 11:07 UTC (permalink / raw)
To: Tejun Heo; +Cc: Brandon Philips, netdev, teheo
In-Reply-To: <20070803102645.GO13674@htj.dyndns.org>
On Fri, 3 Aug 2007 19:26:45 +0900
Tejun Heo <htejun@gmail.com> wrote:
> On Fri, Aug 03, 2007 at 09:58:57AM +0100, Stephen Hemminger wrote:
> > On Thu, 2 Aug 2007 15:42:06 -0700
> > Brandon Philips <brandon@ifup.org> wrote:
> >
> > > This patch set adds support for devres in the net core and converts the
> > > e100 and e1000 drivers to devres. Devres is a simple resource manager
> > > for device drivers, see Documentation/driver-model/devres.txt for more
> > > information.
> > >
> > > The use of devres will remain optional for drivers with this patch set.
> > > Drivers can be converted when it makes sense.
> >
> > Just because devres exists is not sufficient motivation to change.
> >
> > It seems that devres was a band-aid rather than fixing storage drivers
> > to have proper DMA lifetimes.
>
> I don't really get what you mean by "having proper DMA lifetimes" but
> please don't write devres off too fast. devres doesn't solve any
> problem that you can't fix without it but it does make the 'solving'
> much easier.
>
> IMHO, libata drivers generally have been well maintained and reviewed
> but I could still find quite a few bugs (resource leaks or
> occasionally double free) in init failure and removal paths. Init
> failure paths are especially prone to bugs because they don't get
> excercised often. It's just very easy to make a mistake and fail to
> notice and low level drivers don't always get sufficient amount of
> review or testing.
>
> Skimming through drivers... via-rhine doesn't disable PCI device on
> init failure path but does so on removal. sky2 doesn't free
> consistent memory if sky2_init() fails. acenic calls iounmap() with
> NULL parameter which I'm not sure whether it's safe or not. natsemi
> doesn't disable PCI device on failure or removal.
Did you report these to the developers?
> Devres makes low level drivers simpler, easier to get right and
> maintain. Writing new drivers becomes easier too. So, why not?
>
> > Network devices seem to work fine thanks, and the resource requirements
> > are different. If ain't broke, don't fix it.
>
> Care to enlighten me on how the resource requirments are different
> from ATA drivers?
I was thinking of the hot remove (no mod ref counts) and lingering
/sys open issues. ATA drivers use ref counts.
My take on devres is that it is similar to talloc() for device drivers.
Not a bad idea in itself, but the real advantage of hierarchical allocation
is that it makes exception handling easier if things are layered deeply.
^ permalink raw reply
* Re: Distributed storage.
From: Evgeniy Polyakov @ 2007-08-03 10:57 UTC (permalink / raw)
To: Daniel Phillips; +Cc: netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <20070803102629.GB10089@2ka.mipt.ru>
On Fri, Aug 03, 2007 at 02:26:29PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Memory deadlock is a concern of course. From a cursory glance through,
> > it looks like this code is pretty vm-friendly and you have thought
> > quite a lot about it, however I respectfully invite peterz
> > (obsessive/compulsive memory deadlock hunter) to help give it a good
> > going over with me.
Another major issue is network allocations.
Your initial work and subsequent releases made by Peter were originally
opposed on my side, but now I think the right way is to use both
positive moments from your approach and specialized allocator -
essentially what I proposed (in the blog only though) is to bind a
independent reserve for any socket - such a reserve can be stolen from
socket buffer itself (each socket has a limited socket buffer where
packets are allocated from, it accounts both data and control (skb)
lengths), so when main allocation via common path fails, it would be
possible to get data from own reserve. This allows sending sockets to
make a progress in case of deadlock.
For receiving situation is worse, since system does not know in advance
to which socket given packet will belong to, so it must allocate from
global pool (and thus there must be independent global reserve), and
then exchange part of the socket's reserve to the global one (or just
copy packet to the new one, allocated from socket's reseve is it was
setup, or drop it otherwise). Global independent reserve is what I
proposed when stopped to advertise network allocator, but it seems that
it was not taken into account, and reserve was always allocated only
when system has serious memory pressure in Peter's patches without any
meaning for per-socket reservation.
It allows to separate sockets and effectively make them fair - system
administrator or programmer can limit socket's buffer a bit and request
a reserve for special communication channels, which will have guaranteed
ability to have both sending and receiving progress, no matter how many
of them were setup. And it does not require any changes behind network
side.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: Distributed storage.
From: Evgeniy Polyakov @ 2007-08-03 10:44 UTC (permalink / raw)
To: Manu Abraham; +Cc: netdev, linux-kernel, linux-fsdevel
In-Reply-To: <1a297b360708022204u4fc7603pb6baebe2bdf28618@mail.gmail.com>
Hi.
On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham (abraham.manu@gmail.com) wrote:
> On 7/31/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > TODO list currently includes following main items:
> > * redundancy algorithm (drop me a request of your own, but it is highly
> > unlikley that Reed-Solomon based will ever be used - it is too slow
> > for distributed RAID, I consider WEAVER codes)
>
>
> LDPC codes[1][2] have been replacing Turbo code[3] with regards to
> communication links and we have been seeing that transition. (maybe
> helpful, came to mind seeing the mention of Turbo code) Don't know how
> weaver compares to LDPC, though found some comparisons [4][5] But
> looking at fault tolerance figures, i guess Weaver is much better.
>
> [1] http://www.ldpc-codes.com/
> [2] http://portal.acm.org/citation.cfm?id=1240497
> [3] http://en.wikipedia.org/wiki/Turbo_code
> [4] http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf
> [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf
Great thanks for this links, I will definitely study them.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: Distributed storage.
From: Evgeniy Polyakov @ 2007-08-03 10:42 UTC (permalink / raw)
To: Mike Snitzer; +Cc: netdev, linux-kernel, linux-fsdevel, Daniel Phillips
In-Reply-To: <170fa0d20708022109s60ebb85aqe68ec1033634ef27@mail.gmail.com>
Hi Mike.
On Fri, Aug 03, 2007 at 12:09:02AM -0400, Mike Snitzer (snitzer@gmail.com) wrote:
> > * storage can be formed on top of remote nodes and be exported
> > simultaneously (iSCSI is peer-to-peer only, NBD requires device
> > mapper and is synchronous)
>
> Having the in-kernel export is a great improvement over NBD's
> userspace nbd-server (extra copy, etc).
>
> But NBD's synchronous nature is actually an asset when coupled with MD
> raid1 as it provides guarantees that the data has _really_ been
> mirrored remotely.
I believe, that the right answer to this is barrier, but not synchronous
sending/receiving, which might slow things down noticebly. Barrier must
wait until remote side received data and send back a notice. Until
acknowledge is received, no one can say if data mirrored or ever
received by remote node or not.
> > TODO list currently includes following main items:
> > * redundancy algorithm (drop me a request of your own, but it is highly
> > unlikley that Reed-Solomon based will ever be used - it is too slow
> > for distributed RAID, I consider WEAVER codes)
>
> I'd like to better understand where you see DST heading in the area of
> redundancy. Based on your blog entries:
> http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_07_24_1.html
> http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_07_31_2.html
> (and your todo above) implementing a mirroring algorithm appears to be
> a near-term goal for you. Can you comment on how your intended
> implementation would compare, in terms of correctness and efficiency,
> to say MD (raid1) + NBD? MD raid1 has a write intent bitmap that is
> useful to speed resyncs; what if any mechanisms do you see DST
> embracing to provide similar and/or better reconstruction
> infrastructure? Do you intend to embrace any exisiting MD or DM
> infrastructure?
Depending on what algorithm will be preferred - I do not want mirroring,
it is _too_ wasteful in terms of used storage, but it is the simplest.
Right now I still consider WEAVER codes as the fastest in distributed
envornment from what I checked before, but it is quite complex and spec
is (at least for me) not clear in all aspects right now. I did not even
start userspace implementation of that codes. (Hint: spec sucks, kidding :)
For simple mirroring each node must be split to chunks, each one has
representation bin in main node mask, when dirty full chunk is resynced.
Depending on node size and amount of memory chunk size varies. Setup is
performed during node initialization. Having checksum for each chunk
is a good step.
All interfaces are already there, although require cleanup and move from
place to place, but I decided to make initial release small.
> BTW, you have definitely published some very compelling work and its
> sad that you're predisposed to think DST won't be recieved well if you
> pushed for inclusion (for others, as much was said in the 7.31.2007
> blog post I referenced above). Clearly others need to embrace DST to
> help inclusion become a reality. To that end, its great to see that
> Daniel Phillips and the other zumastor folks will be putting DST
> through its paces.
In that blog entry I misspelled Zen with Xen - that's an error,
according to prognosis - time will judge :)
> regards,
> Mike
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [patch 0/5][RFC] Update network drivers to use devres
From: Tejun Heo @ 2007-08-03 10:26 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Brandon Philips, netdev, teheo
In-Reply-To: <20070803095857.5fb3f368@oldman.hamilton.local>
On Fri, Aug 03, 2007 at 09:58:57AM +0100, Stephen Hemminger wrote:
> On Thu, 2 Aug 2007 15:42:06 -0700
> Brandon Philips <brandon@ifup.org> wrote:
>
> > This patch set adds support for devres in the net core and converts the
> > e100 and e1000 drivers to devres. Devres is a simple resource manager
> > for device drivers, see Documentation/driver-model/devres.txt for more
> > information.
> >
> > The use of devres will remain optional for drivers with this patch set.
> > Drivers can be converted when it makes sense.
>
> Just because devres exists is not sufficient motivation to change.
>
> It seems that devres was a band-aid rather than fixing storage drivers
> to have proper DMA lifetimes.
I don't really get what you mean by "having proper DMA lifetimes" but
please don't write devres off too fast. devres doesn't solve any
problem that you can't fix without it but it does make the 'solving'
much easier.
IMHO, libata drivers generally have been well maintained and reviewed
but I could still find quite a few bugs (resource leaks or
occasionally double free) in init failure and removal paths. Init
failure paths are especially prone to bugs because they don't get
excercised often. It's just very easy to make a mistake and fail to
notice and low level drivers don't always get sufficient amount of
review or testing.
Skimming through drivers... via-rhine doesn't disable PCI device on
init failure path but does so on removal. sky2 doesn't free
consistent memory if sky2_init() fails. acenic calls iounmap() with
NULL parameter which I'm not sure whether it's safe or not. natsemi
doesn't disable PCI device on failure or removal.
Devres makes low level drivers simpler, easier to get right and
maintain. Writing new drivers becomes easier too. So, why not?
> Network devices seem to work fine thanks, and the resource requirements
> are different. If ain't broke, don't fix it.
Care to enlighten me on how the resource requirments are different
from ATA drivers?
Thanks.
--
tejun
^ permalink raw reply
* Re: Distributed storage.
From: Evgeniy Polyakov @ 2007-08-03 10:26 UTC (permalink / raw)
To: Daniel Phillips; +Cc: netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <200708021408.24876.phillips@phunq.net>
On Thu, Aug 02, 2007 at 02:08:24PM -0700, Daniel Phillips (phillips@phunq.net) wrote:
> On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> > Hi.
> >
> > I'm pleased to announce first release of the distributed storage
> > subsystem, which allows to form a storage on top of remote and local
> > nodes, which in turn can be exported to another storage as a node to
> > form tree-like storages.
>
> Excellent! This is precisely what the doctor ordered for the
> OCFS2-based distributed storage system I have been mumbling about for
> some time. In fact the dd in ddsnap and ddraid stands for "distributed
> data". The ddsnap/raid devices do not include an actual network
> transport, that is expected to be provided by a specialized block
> device, which up till now has been NBD. But NBD has various
> deficiencies as you note, in addition to its tendency to deadlock when
> accessed locally. Your new code base may be just the thing we always
> wanted. We (zumastor et al) will take it for a drive and see if
> anything breaks.
That would be great.
> Memory deadlock is a concern of course. From a cursory glance through,
> it looks like this code is pretty vm-friendly and you have thought
> quite a lot about it, however I respectfully invite peterz
> (obsessive/compulsive memory deadlock hunter) to help give it a good
> going over with me.
>
> I see bits that worry me, e.g.:
>
> + req = mempool_alloc(st->w->req_pool, GFP_NOIO);
>
> which seems to be callable in response to a local request, just the case
> where NBD deadlocks. Your mempool strategy can work reliably only if
> you can prove that the pool allocations of the maximum number of
> requests you can have in flight do not exceed the size of the pool. In
> other words, if you ever take the pool's fallback path to normal
> allocation, you risk deadlock.
mempool should be allocated to be able to catch up with maximum
in-flight requests, in my tests I was unable to force block layer to put
more than 31 pages in sync, but in one bio. Each request is essentially
dealyed bio processing, so this must handle maximum number of in-flight
bios (if they do not cover multiple nodes, if they do, then each node
requires own request). Sync has one bio in-flight on my machines (from
tiny VIA nodes to low-end amd64), number of normal requests *usually*
does not increase several dozens (less than hundred always), but that
might be only my small systems, so request size was selected as small as
possible and number of allocations decreased to absolutely healthcare
minimum.
> Anyway, if this is as grand as it seems then I would think we ought to
> factor out a common transfer core that can be used by all of NBD,
> iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own
> code those things have now.
>
> Regards,
>
> Daniel
Thanks.
--
Evgeniy Polyakov
^ permalink raw reply
* [PATCH] TCP: H-TCP maxRTT estimation at startup
From: Stephen Hemminger @ 2007-08-03 9:57 UTC (permalink / raw)
To: David S. Miller; +Cc: Douglas Leith, netdev
In-Reply-To: <CA61FA76-2C22-4048-B7AC-1D8E9D94868C@nuim.ie>
Small patch to H-TCP from Douglas Leith.
Fix estimation of maxRTT. The original code ignores rtt measurements
during slow start (via the check tp->snd_ssthresh < 0xFFFF) yet this
is probably a good time to try to estimate max rtt as delayed acking
is disabled and slow start will only exit on a loss which presumably
corresponds to a maxrtt measurement. Second, the original code (via
the check htcp_ccount(ca) > 3) ignores rtt data during what it
estimates to be the first 3 round-trip times. This seems like an
unnecessary check now that the RCV timestamp are no longer used
for rtt estimation.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
--- a/net/ipv4/tcp_htcp.c 2007-08-03 10:51:51.000000000 +0100
+++ b/net/ipv4/tcp_htcp.c 2007-08-03 10:51:53.000000000 +0100
@@ -79,7 +79,6 @@ static u32 htcp_cwnd_undo(struct sock *s
static inline void measure_rtt(struct sock *sk, u32 srtt)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
- const struct tcp_sock *tp = tcp_sk(sk);
struct htcp *ca = inet_csk_ca(sk);
/* keep track of minimum RTT seen so far, minRTT is zero at first */
@@ -87,8 +86,7 @@ static inline void measure_rtt(struct so
ca->minRTT = srtt;
/* max RTT */
- if (icsk->icsk_ca_state == TCP_CA_Open
- && tp->snd_ssthresh < 0xFFFF && htcp_ccount(ca) > 3) {
+ if (icsk->icsk_ca_state == TCP_CA_Open) {
if (ca->maxRTT < ca->minRTT)
ca->maxRTT = ca->minRTT;
if (ca->maxRTT < srtt
^ permalink raw reply
* Re: [REGRESSION] tg3 dead after s2ram
From: Joachim Deguara @ 2007-08-03 9:47 UTC (permalink / raw)
To: Michael Chan
Cc: David Miller, akpm, linux-kernel, michal.k.k.piotrowski, netdev,
linux-acpi
In-Reply-To: <1186081829.18322.20.camel@dell>
On Thursday 02 August 2007 21:10:29 Michael Chan wrote:
> Alternatively, we can also fix it by calling pci_enable_device() again
> in tg3_open(). But I think it is better to just always save and restore
> in suspend/resume. bnx2.c will also require the same fix.
>
> Thanks Joachim for helping to debug this problem. Please try this
> patch:
Patch works for me.
-Joachim
-
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [patch 2/5][RFC] Update net core to use devres.
From: Brandon Philips @ 2007-08-03 9:39 UTC (permalink / raw)
To: Tejun Heo; +Cc: netdev
In-Reply-To: <46B2F1B9.9010605@suse.de>
On 18:13 Fri 03 Aug 2007, Tejun Heo wrote:
> > + p = devres_alloc(devm_free_netdev, 0, GFP_KERNEL);
>
> s/0/sizeof(*p)/
Oops! It should have read like this:
+static void * register_netdev_devres(struct device *gendev,
+ struct net_device *dev)
+{
+ void *p;
+
+ /* 0 size because we don't need it. The net_device is already alloc'd
+ * in alloc_netdev_mq. We can't use devm_kzalloc in alloc_netdev_mq
+ * because a net_device cannot be free'd directly as it can be a
+ * kobject. See free_netdev.
+ */
+ p = devres_alloc(devm_free_netdev, 0, GFP_KERNEL);
+
+ if (unlikely(!p))
+ return NULL;
+
+ devres_add(gendev, p);
+
+ return dev;
+}
I will send the full correct patch.
Thanks,
Brandon
^ permalink raw reply
* Re: [patch 5/5][RFC] Update e1000 driver to use devres.
From: Tejun Heo @ 2007-08-03 9:35 UTC (permalink / raw)
To: Brandon Philips; +Cc: netdev, teheo, Brandon Philips
In-Reply-To: <20070802224552.GG5181@ifup.org>
On Thu, Aug 02, 2007 at 03:45:52PM -0700, Brandon Philips wrote:
> if ((err = pci_request_regions(pdev, e1000_driver_name)))
> - goto err_pci_reg;
> + goto err_dma;
Why not just return? Ditto for all goto err_dma's.
> err = -EIO;
> - adapter->hw.hw_addr = ioremap(mmio_start, mmio_len);
> + adapter->hw.hw_addr = devm_ioremap(&pdev->dev, mmio_start, mmio_len);
This is correct conversion but I have no idea why the origical code
did manual ioremap instead of using pci_iomap().
> - adapter->hw.flash_address = ioremap(flash_start, flash_len);
> + adapter->hw.flash_address = devm_ioremap(&pdev->dev,
> + flash_start,
> + flash_len);
Ditto.
> err_dma:
> pci_disable_device(pdev);
> return err;
err_dma can be killed.
Thanks.
--
tejun
^ permalink raw reply
* Re: [patch 4/5][RFC] Implement devm_kcalloc
From: Tejun Heo @ 2007-08-03 9:20 UTC (permalink / raw)
To: Brandon Philips; +Cc: netdev, teheo, Brandon Philips
In-Reply-To: <20070802224545.GF5181@ifup.org>
On Thu, Aug 02, 2007 at 03:45:45PM -0700, Brandon Philips wrote:
> /**
> + * devm_kcalloc - resource-managed kcalloc
> + * @dev: Device to allocate memory for
> + * @n: number of elements.
> + * @size: element size.
> + * @flags: the type of memory to allocate.
> + */
> +inline void * devm_kcalloc(struct device * dev, size_t n, size_t size,
> + gfp_t flags)
> +{
> + if (n != 0 && size > ULONG_MAX / n)
> + return NULL;
> + return devm_kzalloc(dev, n * size, flags);
> +}
> +EXPORT_SYMBOL_GPL(devm_kcalloc);
Please drop inline. It's meaningless.
Other than that, Acked-by: Tejun Heo <htejun@gmail.com>
--
tejun
^ permalink raw reply
* Re: [patch 1/5][RFC] NET: Change pci_enable_device to pci_reenable_device to keep device enable balance
From: Tejun Heo @ 2007-08-03 9:00 UTC (permalink / raw)
To: Brandon Philips; +Cc: netdev, Brandon Philips
In-Reply-To: <20070802224423.GC5181@ifup.org>
Brandon Philips wrote:
> On a slot_reset event pci_disable_device() is never called so calling
> pci_enable_device() will unbalance the enable count.
>
> Signed-off-by: Brandon Philips <bphilips@suse.de>
Acked-by: Tejun Heo <htejun@gmail.com>
--
tejun
^ permalink raw reply
* Re: [patch 3/5][RFC] Update e100 driver to use devres.
From: Tejun Heo @ 2007-08-03 9:18 UTC (permalink / raw)
To: Brandon Philips; +Cc: netdev, teheo, Brandon Philips
In-Reply-To: <20070802224537.GE5181@ifup.org>
On Thu, Aug 02, 2007 at 03:45:37PM -0700, Brandon Philips wrote:
> if((err = pci_request_regions(pdev, DRV_NAME))) {
> DPRINTK(PROBE, ERR, "Cannot obtain PCI resources, aborting.\n");
> - goto err_out_disable_pdev;
> + return err;
> }
>
> if((err = pci_set_dma_mask(pdev, DMA_32BIT_MASK))) {
> DPRINTK(PROBE, ERR, "No usable DMA configuration, aborting.\n");
> - goto err_out_free_res;
> + return err;
> }
>
> SET_MODULE_OWNER(netdev);
> @@ -2613,11 +2606,11 @@ static int __devinit e100_probe(struct p
> if (use_io)
> DPRINTK(PROBE, INFO, "using i/o access mode\n");
>
> - nic->csr = pci_iomap(pdev, (use_io ? 1 : 0), sizeof(struct csr));
> + nic->csr = pcim_iomap(pdev, (use_io ? 1 : 0), sizeof(struct csr));
> if(!nic->csr) {
> DPRINTK(PROBE, ERR, "Cannot map device registers, aborting.\n");
> err = -ENOMEM;
> - goto err_out_free_res;
> + return err;
Calls to pci_request_regions() and pcim_iomap() can be merged into
pcim_iomap_regions().
Other than that, Acked-by: Tejun Heo <htejun@gmail.com>
--
tejun
^ permalink raw reply
* Re: [patch 2/5][RFC] Update net core to use devres.
From: Tejun Heo @ 2007-08-03 9:13 UTC (permalink / raw)
To: Brandon Philips; +Cc: netdev, Brandon Philips
In-Reply-To: <20070802224527.GD5181@ifup.org>
> +static inline void * register_netdev_devres(struct device *gendev,
> + struct net_device *dev)
> +{
> + struct net_device **p;
> +
> + /* 0 size because we don't need it. The net_device is already alloc'd
> + * in alloc_netdev_mq. We can't use devm_kzalloc in alloc_netdeev_mq
> + * because a net_device cannot be free'd directly as it can be a
> + * kobject. See free_netdev.
> + */
> + p = devres_alloc(devm_free_netdev, 0, GFP_KERNEL);
s/0/sizeof(*p)/
--
tejun
^ permalink raw reply
* Re: [patch] genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski @ 2007-08-03 9:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Gabriel C, Linus Torvalds, Thomas Gleixner, Jean-Baptiste Vignaud,
linux-kernel, shemminger, linux-net, netdev, Andrew Morton,
Alan Cox, marcin.slusarz
In-Reply-To: <20070803080408.GA12222@elte.hu>
On Fri, Aug 03, 2007 at 10:04:08AM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
>
> > I can't guarantee this is all needed to fix this bug, but I think this
> > patch is necessary here.
>
> hmmm ... very interesting! Now _this_ is something we'd like to see
> tested. Could you send a patch to Marcin that also undoes the workaround
> we have in place now, so that he could check whether ne2k-pci works fine
> with your fix alone?
I'm not sure this is needed... Marcin got this patch, I hope, and I
don't have another possibility to contact with him. Since he managed
with this bisection and all the previous patches I don't think there
could be any problems, so:
Marcin! I'd be very glad if you could test this patch alone; this
should apply without any problems to 2.6.21 (with some offset) and
later "vanilla" versions (or try to revert Ingo's last patch with
patch -p1 -R). Please, contact me on any problems (alas not during
the weekend...).
Thanks,
Jarek P.
PS: of course, I'm very curious of this testing too, but, on the other
hand, as I've written earlier, I think this patch is needed for logical
reasons only, and it really doesn't look like it could make any damage
here.
^ permalink raw reply
* Re: [patch 0/5][RFC] Update network drivers to use devres
From: Stephen Hemminger @ 2007-08-03 8:58 UTC (permalink / raw)
To: Brandon Philips; +Cc: netdev, teheo
In-Reply-To: <20070802224206.GB5181@ifup.org>
On Thu, 2 Aug 2007 15:42:06 -0700
Brandon Philips <brandon@ifup.org> wrote:
> This patch set adds support for devres in the net core and converts the
> e100 and e1000 drivers to devres. Devres is a simple resource manager
> for device drivers, see Documentation/driver-model/devres.txt for more
> information.
>
> The use of devres will remain optional for drivers with this patch set.
> Drivers can be converted when it makes sense.
Just because devres exists is not sufficient motivation to change.
It seems that devres was a band-aid rather than fixing storage drivers
to have proper DMA lifetimes.
Network devices seem to work fine thanks, and the resource requirements
are different. If ain't broke, don't fix it.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox