Kernel Traffic #328 For 19 Sep 2005

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 2152 posts in 13MB. See the Full Statistics.

There were 665 different contributors. 260 posted more than once. The average length of each message was 106 lines.

The top posters of the week were: The top subjects of the week were:
85 posts in 2MB by luben tuikov
71 posts in 278KB by andi kleen
61 posts in 248KB by [email protected]
58 posts in 329KB by sam ravnborg
47 posts in 229KB by john w. linville
100 posts in 414KB for "rfc: i386: kill !4kstacks"
61 posts in 373KB for "[patch 1/3] dynticks - implement no idle hz for x86"
56 posts in 227KB for "gfs, what's remaining"
44 posts in 192KB for "[linux-cluster] re: gfs, what's remaining"
39 posts in 245KB for "[patch 2.6.13 5/14] sas-class: sas_discover.c discover process"

These stats generated by mboxstats version 2.8

1. Status Of Merging GFS2 Into Mainline

1 Sep 2005 - 14 Sep 2005 (110 posts) Archive Link: "GFS, what's remaining"

Topics: Disk Arrays: LVM, FS: NFS, FS: NTFS, FS: ReiserFS, FS: XFS, FS: ext2, FS: ext3, FS: sysfs, Ioctls, POSIX

People: David TeiglandDaniel PhillipsAndrew MortonAlan CoxChristoph HellwigLars Marowsky-BreeAndi KleenMark FashehJoel BeckerKurt C. HackelPatrick CaulfieldArjan van de VenPekka Enberg

David Teigland said:

this is the latest set of gfs patches, it includes some minor munging since the previous set. Andrew, could this be added to -mm? there's not much in the way of pending changes.

I'd like to get a list of specific things remaining for merging. I believe we've responded to everything from earlier reviews, they were very helpful and more would be excellent. The list begins with one item from before that's still pending:

Arjan van de Ven offered concrete criticisms on the patches themselves, pointing out races and other problems; and David and others discussed these. Elsewhere, Pekka Enberg pointed out that the requirement to walk VMA lists wasn't just a case of "some not liking it", but would actually prevent GFS from working properly with other clustered filesystems. And Daniel Phillips also brought some perspective to the whole prospect of a GFS merge into mainline, saying:

Where are the benchmarks and stability analysis? How many hours does it survive cerberos running on all nodes simultaneously? Where are the testimonials from users? How long has there been a gfs2 filesystem? Note that Reiser4 is still not in mainline a year after it was first offered, why do you think gfs2 should be in mainline after one month?

So far, all catches are surface things like bogus spinlocks. Substantive issues have not even begun to be addressed. Patience please, this is going to take a while.

Andrew Morton also asked for answers to a few basic questions:

I don't recall seeing much discussion or exposition of

Alan Cox pointed out what he felt was a simple answer to all these questions: "people actively use it and have been for some years. Same reason with have NTFS, HPFS, and all the others. On that alone it makes sense to include." But Christoph Hellwig remarked, "That's GFS. The submission is about a GFS2 that's on-disk incompatible to GFS." Alan replied:

Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3 then. I think the main point still stands - we have always taken multiple file systems on board and we have benefitted enormously from having the competition between them instead of a dictat from the kernel kremlin that 'foofs is the one true way'

Competition will decide if OCFS or GFS is better, or indeed if someone comes along with another contender that is better still. And competition will probably get the answer right.

The only thing that is important is we don't end up with each cluster fs wanting different core VFS interfaces added.

Lars Marowsky-Bree was not as trusting in the virtues of competition, pointing out that "Competition will come up with the same situation like reiserfs and ext3 and XFS, namely that they'll all be maintained going forward because of, uhm, political constraints ;-)" But he also affirmed, "as long as they _are_ maintained and play along nicely with eachother (which, btw, is needed already so that at least data can be migrated...), I don't really see a problem of having two or three." He also agreed that requiring different core VFS interfaces would be unacceptable.

Andrew reiterated his question, saying he was looking for technical reasons in favor of inclusion. David offered:

GFS is an established fs, it's not going away, you'd be hard pressed to find a more widely used cluster fs on Linux. GFS is about 10 years old and has been in use by customers in production environments for about 5 years. It is a mature, stable file system with many features that have been technically refined over years of experience and customer/user feedback. The latest development cycle (GFS2) has focussed on improving performance, it's not a new file system -- the "2" indicates that it's not ondisk compatible with earlier versions.

OCFS2 is a new file system. I expect they'll want to optimize for their own unique goals. When OCFS appeared everyone I know accepted it would coexist with GFS, each in their niche like every other fs. That's good, OCFS and GFS help each other technically even though they may eventually compete in some areas (which can also be good.)

Here's a random summary of technical features:

Arjan short-circuited any discussion of these particular features, pointing out that David's description referred to GFS, not to GFS2 which, as others had already pointed out, was not compatible. David replied:

Just a new version, not a big difference. The ondisk format changed a little making it incompatible with the previous versions. We'd been holding out on the format change for a long time and thought now would be a sensible time to finally do it.

This is also about timing things conveniently. Each GFS version coincides with a development cycle and we decided to wait for this version/cycle to move code upstream. So, we have new version, format change, and code upstream all together, but it's still the same GFS to us.

As with _any_ new version (involving ondisk formats or not) we need to thoroughly test everything to fix the inevitible bugs and regresssions that are introduced, there's nothing new or surprising about that.

About the name -- we need to support customers running both versions for a long time. The "2" was added to make that process a little easier and clearer for people, that's all. If the 2 is really distressing we could rip it off, but there seems to be as many file systems ending in digits than not these days...

Daniel asked what the on-disk format change was all about, but there was no reply to that post. Elsewhere, various folks made serious efforts to answer his request for technical reasons for or against inclusion. Andi Kleen kicked off that branch of discussion, saying to Andrew:

There seems to be clearly a need for a shared-storage fs of some sort for HA clusters and virtualized usage (multiple guests sharing a partition). Shared storage can be more efficient than network file systems like NFS because the storage access is often more efficient than network access and it is more reliable because it doesn't have a single point of failure in form of the NFS server.

It's also a logical extension of the "failover on failure" clusters many people run now - instead of only failing over the shared fs at failure and keeping one machine idle the load can be balanced between multiple machines at any time.

One argument to merge both might be that nobody really knows yet which shared-storage file system (GFS or OCFS2) is better. The only way to find out would be to let the user base try out both, and that's most practical when they're merged.

Personally I think ocfs2 has nicer & cleaner code than GFS. It seems to be more or less a 64bit ext3 with cluster support, while GFS seems to reinvent a lot more things and has somewhat uglier code. On the other hand GFS' cluster support seems to be more aimed at being a universal cluster service open for other usages too, which might be a good thing. OCFS2s cluster seems to be more aimed at only serving the file system.

But which one works better in practice is really an open question.

The only thing that should be probably resolved is a common API for at least the clustered lock manager. Having multiple incompatible user space APIs for that would be sad.

Andi's term "clustered lock manager" is more commonly known as "distributed lock manager" or DLM. This was the term taken up for the rest of the discussion, and becoming the primary focus as well. In this light, Daniel Phillips replied to Andi:

The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. Therefore, the (g)dlm userspace interface actually has nothing to do with the needs of gfs. It should be taken out the gfs patch and merged later, when or if user space applications emerge that need it. Maybe in the meantime it will be possible to come up with a userspace dlm api that isn't completely repulsive.

Also, note that the only reason the two current dlms are in-kernel is because it supposedly cuts down on userspace-kernel communication with the cluster filesystems. Then why should a userspace application bother with a an awkward interface to an in-kernel dlm? This is obviously suboptimal. Why not have a userspace dlm for userspace apps, if indeed there are any userspace apps that would need to use dlm-style synchronization instead of more typical socket-based synchronization, or Posix locking, which is already exposed via a standard api?

There is actually nothing wrong with having multiple, completely different dlms active at the same time. There is no urgent need to merge them into the one true dlm. It would be a lot better to let them evolve separately and pick the winner a year or two from now. Just think of the dlm as part of the cfs until then.

What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete.

Close by, Mark Fasheh also said to Andi, "As far as userspace dlm apis go, dlmfs already abstracts away a large part of the dlm interaction, so writing a module against another dlm looks like it wouldn't be too bad (startup of a lockspace is probably the most difficult part there)." Daniel asked why SysFS would not work just as well for this, and Wim Coekaerts replied cryptically that the two were totally different. Daniel replied:

You create a dlm domain when a directory is created. You create a lock resource when a file of that name is opened. You lock the resource when the file is opened. You access the lvb by read/writing the file. Why doesn't that fit the configfs-nee-sysfs model? If it does, the payoff will be about 500 lines saved.

This little dlm fs is very slick, but grossly inefficient. Maybe efficiency doesn't matter here since it is just your slow-path userspace tools taking these locks. Please do not even think of proposing this as a way to export a kernel-based dlm for general purpose use!

Your userdlm.c file has some hidden gold in it. You have factored the dlm calls far more attractively than the bad old bazillion-parameter Vaxcluster legacy. You are almost in system call zone there. (But note my earlier comment on dlms in general: until there are dlm-based applications, merging a general-purpose dlm API is pointless and has nothing to do with getting your filesystem merged.)

Andrew agreed that "Daniel is asking a legitimate question." He went on, "If there's duplicated code in there then we should seek to either make the code multi-purpose or place the common or reusable parts into a library somewhere. If neither approach is applicable or practical for *every single function* then fine, please explain why. AFAIR that has not been done." Joel Becker replied:

Regarding sysfs and configfs, that's a whole 'nother conversation. I've not yet come up with a function involved that is identical, but that's a response here for another email.

Understanding that Daniel is talking about dlmfs, dlmfs is far more similar to devptsfs, tmpfs, and even sockfs and pipefs than it is to sysfs. I don't see him proposing that sockfs and devptsfs be folded into sysfs.

dlmfs is *tiny*. The VFS interface is less than his claimed 500 lines of savings. The few VFS callbacks do nothing but call DLM functions. You'd have to replace this VFS glue with sysfs glue, and probably save very few lines of code.

In addition, sysfs cannot support the dlmfs model. In dlmfs, mkdir(2) creates a directory representing a DLM domain and mknod(2) creates the user representation of a lock. sysfs doesn't support mkdir(2) or mknod(2) at all.

More than mkdir() and mknod(), however, dlmfs uses open(2) to acquire locks from userspace. O_RDONLY acquires a shared read lock (PR in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock is released via close(2). If a process dies, close(2) happens. In other words, ->release() handles all the cleanup for normal and abnormal termination.

sysfs does not allow hooking into ->open() or ->release(). So this model, and the inherent lifetiming that comes with it, cannot be used. If dlmfs was changed to use a less intuitive model that fits sysfs, all the handling of lifetimes and cleanup would have to be added. This would make it more complex, not less complex. It would give it a larger code size, not a smaller one. In the end, it would be harder to maintian, less intuitive to use, and larger.

The DLM debate and its relationship to GFS acceptance became very technical, with many tendrils of discussion, that did not lead to any clear conclusion, in spite of the fact that Andrew was a very active participant in leading the discussion. The closest thing to a decision that came out of the discussion came when David, who'd opened the whole discussion, said that GFS depended on the full DLM API, and would find it impractical to rely on anything else. He said, "We export our full dlm API through read/write/poll on a misc device. All user space apps use the dlm through a library as you'd expect. The library communicates with the dlm_device kernel module through read/write/poll and the dlm_device module talks with the actual dlm: linux/drivers/dlm/device.c If there's a better way to do this, via a pseudo fs or not, we'd be pleased to try it." Andrew replied, "inotify did that for a while, but we ended up going with a straight syscall interface. How fat is the dlm interface? ie: how many syscalls would it take?" David replied that only 4 functions would be needed: create_lockspace(), release_lockspace(), lock(), and unlock(). Kurt C. Hackel from Oracle replied:

FWIW, it looks like we can agree on the core interface. ocfs2_dlm exports essentially the same functions:


I also implemented dlm_migrate_lockres() to explicitly remaster a lock on another node, but this isn't used by any callers today (except for debugging purposes). There is also some wiring between the fs and the dlm (eviction callbacks) to deal with some ordering issues between the two layers, but these could go if we get stronger membership.

There are quite a few other functions in the "full" spec(1) that we didn't even attempt, either because we didn't require direct user<->kernel access or we just didn't need the function. As for the rather thick set of parameters expected in dlm calls, we managed to get dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty much complete interface to the same calls you have in kernel, validated on the write() calls to the misc device. With dlmfs, we were seeking to lock down and simplify user access by using standard ast/bast/unlockast calls, using a file descriptor as an opaque token for a single lock, letting the vfs lifetime on this fd help with abnormal termination, etc. I think both the misc device and dlmfs are helpful and not necessarily mutually exclusive, and probably both are better approaches than exporting everything via loads of syscalls (which seems to be the VMS/opendlm model).

Andrew liked the 4 syscall requirement, saying, "Neat. I'd be inclined to make them syscalls then. I don't suppose anyone is likely to object if we reserve those slots." Daniel cautioned that the function parameters might be a bit ugly, but David said it was likely there would be no more than 2 or 3 for any of them. But Alan Cox spoke out vehemently against this whole course of action. He said:

If the locks are not file descriptors then answer the following:

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new syscalls is a good idea. Every time history proves them totally bonkers. There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You can't use common interfaces on them, you can't select on them, you can't sanely pass them by fd passing.

All our existing locking uses the following behaviour

        fd = open(namespace, options)
        fcntl(.. lock ...)
        fcntl(.. unlock ...)

Unfortunately some people here seem to have forgotten WHY we do things this way.

  1. The semantics of file descriptors are well understood by users and by programs. That makes programming easier and keeps code size down
  2. Everyone knows how close() works including across fork
  3. FD passing is an obscure art but understood and just works
  4. Poll() is a standard understood interface
  5. Ownership of files is a standard model
  6. FD passing across fork/exec is controlled in a standard way
  7. The semantics for threaded applications are defined
  8. Permissions are a standard model
  9. Audit just works with the same tools
  10. SELinux just works with the same tools
  11. I don't need specialist applications to see the system state (the whole point of sysfs yet someone wants to break it all again)
  12. fcntl fd locking is a posix standard interface with precisely defined semantics. Our extensions including leases are very powerful
  13. And yes - fcntl fd locking supports mandatory locking too. That also is standards based with precise semantics.

Everyone understands how to use the existing locking operations. So if you use the existing interfaces with some small extensions if neccessary everyone understands how to use cluster locks. Isn't that neat....

Andrew disagreed that the new syscalls would be such grave violations. He pointed out that "David said that "We export our full dlm API through read/write/poll on a misc device.". That miscdevice will simply give us an fd. Hence my suggestion that the miscdevice be done away with in favour of a dedicated syscall which returns an fd." Alan didn't reply.

At right around this point, Patrick Caulfield got home from vacation, and threw out his take on things:

let me tell you what we do now and why and lets see what's wrong with it.

Currently the library create_lockspace() call returns an FD upon which all lock operations happen. The FD is onto a misc device, one per lockspace, so if you want lockspace protection it can happen at that level. There is no protection applied to locks within a lockspace nor do I think it's helpful to do so to be honest. Using a misc device limits you to <255 lockspaces depending on the other uses of misc but this is just for userland-visible lockspace - it does not affect GFS filesystems for instance.

Lock/convert/unlock operations are done using write calls on that lockspace FD. Callbacks are implemented using poll and read on the FD, read will return data blocks (one per callback) as long as there are active callbacks to process. The current read functionality behaves more like a SOCK_PACKET than a data stream which some may not like but then you're going to need to know what you're reading from the device anyway.

ioctl/fcntl isn't really useful for DLM locks because you can't do asynchronous operations on them - the lock has to succeed or fail in the one operation - if you want a callback for completion (or blocking notification) you have to poll the lockspace FD anyway and then you might as well go back to using read and write because at least they are something of a matched pair. Something similar applies, I think, to a syscall interface.

Another reason the existing fcntl interface isn't appropriate is that it's not locking the same kind of thing. Current Unix fcntl calls lock byte ranges. DLM locks arbitrary names and has a much richer list of lock modes. Adding another fcntl just runs in the problems mentioned above.

The other reason we use read for callbacks is that there is information to be passed back: lock status, value block and (possibly) query information.

While having an FD per lock sounds like a nice unixy idea I don't think it would work very well in practice. Applications with hundreds or thousands of locks (such as databases) would end up with huge pollfd structs to manage, and it while it helps the refcounting (currently the nastiest bit of the current dlm_device code) removes the possibility of having persistent locks that exist after the process exits - a handy feature that some people do use, though I don't think it's in the currently submitted DLM code. One FD per lock also gives each lock two handles, the lock ID used internally by the DLM and the FD used externally by the application which I think is a little confusing.

I don't think a dlmfs is useful, personally. The features you can export from it are either minimal compared to the full DLM functionality (so you have to export the rest by some other means anyway) or are going to be so un-filesystemlike as to be very awkward to use. Doing lock operations in shell scripts is all very cool but how often do you /really/ need to do that?

I'm not saying that what we have is perfect - far from it - but we have thought about how this works and what we came up with seems like a good compromise between providing full DLM functionality to userspace using unix features. But we're very happy to listen to other ideas - and have been doing I hope.

The discussion ended here, with no certain conclusion, though Andrew's syscall preference may hold sway.

2. Review Period In Preparation For

7 Sep 2005 - 9 Sep 2005 (14 posts) Archive Link: "[PATCH 0/9] -stable review"

Topics: Assembly, Digital Video Broadcasting, Networking, PCI, Power Management: ACPI

People: Chris WrightStephen HemmingerJames BottomleyDavid S. MillerBenjamin HerrenschmidtAlexander ViroTheodore Ts'oLinus TorvaldsRandy DunlapAlan CoxMark HaverkampDavid WoodhousePatrick McHardyAndrew MortonZwane Mwaikambo

Chris Wright said:

This is the start of the stable review cycle for the release. There are 9 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let us know. If anyone is a maintainer of the proper subsystem, and wants to add a signed-off-by: line to the patch, please respond with it.

These patches are sent out with a number of different people on the Cc: line. If you wish to be a reviewer, please email [email protected] to add your name to the list. If you want to be off the reviewer list, also email us.

The Cc list contained Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap, Chuck Wolber, Linus Torvalds, Andrew Morton, and Alan Cox, in addition to the linux-kernel mailing list itself.

Each of Chris's replies had a single patch, with these changelog entries:

3. Some Advice For Upgrading From 2.4 To 2.6

8 Sep 2005 - 9 Sep 2005 (5 posts) Archive Link: "How to plan a kernel update ?"

People: Michael ThonkeJesper Juhl

For his job, Weber Ress had to lead a team of engineers to upgrade the kernel from 2.4 to 2.6 in many servers. He asked for advice. Michael Thonke suggested that "google is your best friend and first source for it." And gave a link to William von Hagen's article ( on the subject. Jesper Juhl also said:

I do upgrade a lot of kernels, so I'll tell you a little about what I do and what I'd recommend. Then you can do with that info what you like :)

The very first thing you want to do is to ensure that all core utilities/tools are up-to-date to versions that will work with your new kernel.

If you download a copy of the 2.6.13 kernel source, extract it, and look in the file Documentation/Changes you'll see a list of tools and utils along with the minimum required version for them to work properly with that kernel. Ensure those tools are OK.

Once you are sure the core utils are up-to-date you need to go check whatever other important programs you have on the machine(s) and check that those are also able to run OK with the new kernel.

Once you are satisfied that everything is up to a level that'll work with the new kernel you can go build the new 2.6.13 kernel and drop it in place. You don't need to remove your existing kernel first, you can just install the 2.6.13 kernel side by side with the old one and test boot it, then if it doesn't work right you can always reboot back to the old one.

Most likely you can find documentation for your distribution stating what version of it is "2.6 ready" - I use Slackware for example, and Slackware 10.1 is completely 2.6 kernel ready, so on a Slackware 10.1 box there's no hassle at all, I just drop in a 2.6 kernel in place of the 2.4 one it installs by default and everything is good - all tools are already ready to cope.

4. Status Of Serial SCSI; Some Dispute Over Direction

9 Sep 2005 - 13 Sep 2005 (6 posts) Archive Link: "[ANNOUNCE 0/2] Serial Attached SCSI (SAS) support for the Linux kernel"

Topics: Disks: SCSI, FS: sysfs, Hot-Plugging, Ioctls, Ottawa Linux Symposium, SMP, Serial ATA

People: Luben TuikovChristoph HellwigAndrew PattersonDouglas Gilbert

Luben Tuikov from Adaptec said:

The following announcements and patches introduce Serial Attached SCSI (SAS) support for the Linux kernel. Everything is supported.

The infrastructure is broken into

The SAS LLDD does phy/OOB management, and generates SAS events to the SAS Layer. Those events are *the only way* a SAS LLDD communicates with the SAS Layer. If you can generate 2 types of event, then you can use this infrastructure. The first two are, loosely, "link was severed", "bytes were dmaed". The third kind is "received a primitive", used for domain revalidation.

A SAS LLDD should implement the Execute Command SCSI RPC and at least one SCSI TMF (Task Management Function), in order for the SAS Layer to communicate with the SAS LLDD.

The SAS Layer is concerned with

The SAS Layer uses the Execute Command SCSI RPC, and the TMFs implemented by the SAS LLDD in order to manage the domain and the domain devices.

For details please see drivers/scsi/sas-class/README.

The SAS Layer represents the SAS domain in sysfs. For each object represented, its parent is the physical entity it attaches to in the physical world. So in effect, kobject_get, gets the whole chain up on which that object depends on.

In effect, the sysfs representation of the SAS domain(s) is what you'd see in the physical world.

Hot plugging and hot unplugging of devices, domains and subdomains is supported. Repeated hot plugging and hot unplugging is also supported, naturally.

SAS introduces a new physical entity, an expander. Expanders are _not_ SAS devices, and thus are _not_ SCSI devices. Expanders are part of the Service Delivery Subsystem, in this case SAS.

Expanders are controlled using the Serial Management Protocol (SMP). Complete control is given to user space of all expanders found in the domain, using an "smp_portal". More of this in the second and third email in this series.

A user space program, "expander_conf.c" is also presented to show how one controls expanders in the domain. It is located here: drivers/scsi/sas-class/expanders_conf.c

The second email in this series shows an example of SAS domains and their representation in sysfs.

The third email in this series shows an example of using the "expander_conf.c" program to query all expanders in the domain, showing their attributes, their phys, and their routing tables.

If you have the hardware, please give it a try. If you have expander(s) it would be even more interesting.

Patches of the SAS Layer and of the AIC94XX SAS LLDD follow.

You can also download the patches from

Christoph Hellwig said, "At the core it's some really nice code dealing with host-based SAS implementations. What's not nice is that it's not intgerating with the SAS transport class I posted, it's duplicating things like LUN disocvery from the SCSI core code, and adding it's own sysfs representation that's very different from the way the SCSI core and transport classes do it. Are you willing to work with us to intgerate it with the infrastructure we have?" Luben replied, "HP and LSI were aware of my efforts since the beginning of the year. As well, you had a copy of my code July 14 this year, long before starting your work on your SAS class for LSI and HP (so its acceptance is guaranteed), after OLS. We did meet at OLS and we did have the SAS BOF. I'm not sure why you didn't want to work together?" He invited Christoph to base future work on Luben's implementation. Andrew Patterson from Hewlett Packard replied, "This effort started on April. Eric Moore, Mike Miller and I started work on a SAS transport class and then later pulled Luben it at the suggestion of Douglas Gilbert (if I remember correctly). We later mutually agreed that Luben would take over the transport class work as he seemed to have much more experience with this sort of thing. The original idea was to implement a SAS transport class that would allow the LSI and Adaptec driver to get into (or others at the time) and to find a way to get SDI/CSMI API's into the kernel without the use of IOCTL's. Luben then went off on his own and came up with his effectively Adaptec only solution." He also added, regarding the OLS BOF, "If my memory serves correctly, there were 10-12 people at that BOF, representing the SCSI kernel maintainers and all of the vendors currently providing SAS hardware. Virtually everyone disagreed with your implementation (which you indeed emailed shortly before the conference) that would only work with one vendor's card. The suggestion was made that you convert your code to various library layers so that it would work with all vendors. A suggestion which it seems that you continue to reject."

5. SBC8360 Watchdog Driver Heading Into Mainline

9 Sep 2005 - 10 Sep 2005 (3 posts) Archive Link: "[WATCHDOG] Push SBC8360 driver upstream"

People: Ian E. MorganAndrew MortonWim Van Sebroeck

Ian E. Morgan said:

I would like to ask that the SBC8360 watchdog driver be pushed upstream from -mm in time for the 2.6.14-rc series.

I recognise that this driver, like a lot of the watchdog drivers, is for a piece of hardware this is present in only a very small percentage of hardware runnig Linux. I doubt that being in -mm for a long time will make any significant difference to it being more widely tested. The driver is working perfectly as expected on each of the machines we've tested it on.

As a recap, the driver was submitted to akpm, was included in -mm1 (watchdog-new-sbc8360-driver.patch), offloaded to Wim's linux-2.6-watchdog-mm.git tree (commit 88b1f50923d14195ac1a50840fc4aa4066f067a9), and subsequently included in -mm2 by way of the combined git-watchdog.patch.

Please consider merging this driver into 2.6.14-rc1. Thanks.

Andrew Morton replied, "That's in Wim's tree now. Wim, could you please prepare a pull for Linus within the next couple of days?" Wim Van Sebroeck said, "I'm preparing the tree for linus to pull from. Should be there by the end of the weekend. (Will probably contain 6 drivers + some updates of some other drivers)."

6. DevFS Still On The Chopping Block; Users Still Resistant

9 Sep 2005 - 14 Sep 2005 (32 posts) Archive Link: "[GIT PATCH] Remove devfs from 2.6.13"

Topics: FS: devfs, FS: sysfs, Sound: ALSA

People: Greg KHMike BellDavid LangValdis Kletnieks

Greg KH, having been stymied in his effort to remove DevFS in time for the 2.6.12 release, now submitted the identical patches against 2.6.13; he hoped this time they would make it in. He added, "Also, if people _really_ are in love with the idea of an in-kernel devfs, I have posted a patch that does this in about 300 lines of code, called ndevfs. It is available in the archives if anyone wants to use that instead (it is quite easy to maintain that patch outside of the kernel tree, due to it only needing 3 hooks into the main kernel tree.)" Mike Bell replied that NDevFS was "broken by design. It creates yet another incompatible naming scheme for devices, and what's worse the devices it breaks are the ones like ALSA and the input subsystem, whose locations are hard-coded into libraries. Unless sysfs is going to get attributes from which the proper names could be derived, it won't ever work." Greg replied that he knew NDevFS wasn't a nice solution, it was just an alternative. He added, "Anyway, I'm not offering it up for inclusion in the kernel tree at all, but for a proof-of-concept for those who were insisting that it was impossible to keep a devfs-like patchset out of the main kernel tree easily."

Elsewhere, David Lang said it was important to be cautious in removing DevFS, because of the dangers of breaking various systems. Greg replied, "Ok, how long should I wait then?" And David said:

if 2.6.13 removed the devfs config option, then I would say the code itself should stay until 2.6.15 or 2.6.16 (if the release schedule does drop down to ~2 months then it would need to be at lease .16). especially with so many people afraid of the 2.6 series you need to wait at least one full release cycle, probably two (and possibly more if they end up being short ones) then rip out the rest of the code for the following release.

remember that the distros don't package every kernel, they skip several between releases and it's not going to be until they go to try them that all the kinks will get worked out.

add to this the fact that many people have gotten confused about kernel releases and think that since 13 is odd 2.6.13 is a testing kernel and 2.6.14 will be a stable one and so won't look at .13

note that all this assumes that the issues that people have about sysfs not yet being able to replace that capabilities that they are useing in devfs have been addressed.

Greg said he wasn't aware of any major distribution shipping kernels with DevFS enabled. He and Valdis Kletnieks asked if anyone knew of any that did. Bastian Blank said Debian Unstable did, though as someone else pointed out, no one could confuse Debian Unstable with a shippable distribution. Beyond that, no one was able to come up with even a single distribution shipping DevFS.

The thread ended with no hard conclusions about how long DevFS can expect to live in the kernel.

7. Linux Released

9 Sep 2005 (2 posts) Archive Link: "Linux"

Topics: Assembly, Digital Video Broadcasting, PCI, Security

People: Chris WrightStephen HemmingerDavid S. MillerBenjamin HerrenschmidtChris Wright:Ivan KokshayskyMark HaverkampDavid Woodhouse

Chris Wright announced Linux, saying:

We (the -stable team) are announcing the release of the kernel.

The diffstat and short summary of the fixes are below.

I'll also be replying to this message with a copy of the patch between 2.6.13 and, as it is small enough to do so.

The updated 2.6.13.y git tree can be found at:
and can be browsed at the normal git web browser: (

He listed the changes from 2.6.13 to

Al Viro:
raw_sendmsg DoS (CAN-2005-2492)

Benjamin Herrenschmidt:
Fix PCI ROM mapping

Chris Wright:

David S. Miller:
Use SA_SHIRQ in sparc specific code.

David Woodhouse:
32bit sendmsg() flaw (CAN-2005-2490)

Herbert Xu:
2.6.13 breaks libpcap (and tcpdump)
Fix boundary check in standard multi-block cipher processors

Ivan Kokshaysky:
x86: pci_assign_unassigned_resources() update

Mark Haverkamp:
aacraid: 2.6.13 aacraid bad BUG_ON fix

Michael Krufky:
Kconfig: saa7134-dvb must select tda1004x

Stephen Hemminger:
Reassembly trim not clearing CHECKSUM_HW

8. Status Of Exposing Certain NUMA Data To Userspace

10 Sep 2005 (7 posts) Archive Link: "NUMA mempolicy /proc code in mainline shouldn't have been merged"

People: Andi KleenAndrew MortonChristoph Lameter

Andi Kleen said:

Just noticed the ugly SGI /proc/*/numa_maps code got merged. I argued several times against it and I very deliberately didn't include a similar facility when I wrote the NUMA policy code because it's a bad idea.

Can the patch please be removed?

Andrew Morton said he queued up a patch reversion that should take care of it. Christoph Lameter felt that patch was quite salvageable, and didn't see why it should be reverted. Andrew replied, "If it's useful to application developers then fine. It it's only useful to kernel developers then the argument is weakened. However there's still quite a lot of development going on in this area, so there's still some argument for having the monitoring ability in the mainline tree." Christoph replied:

I still have a hard time to see how people can accept the line of reasoning that says:

Users are not allowed to know on which nodes the operating system allocated resources for a process and are also not allowed to see the memory policies in effect for the memory areas

Then the application developers have to guess the effect that the memory policies have on memory allocation. For memory alloc debugging the poor app guys must today simply imagine what the operating system is doing. They can see the amount of total memory allocated on a node via other proc entries and then guess based on that which application has taken it. Then they modify their apps and do another run.

My thinking today is that I'd rather leave /proc/<pid>/numa_stats instead of using smaps because the smaps format is a bit verbose and will make it difficult to see the allocation distribution. If we use smaps then we probably need some tool to parse and present information. numa_stats is directly usable.

I have a new series of patches here that does a gradual thing with the policy layer:

  1. Clean up policy layer to properly use node macros instead of bitmaps. Some comments to explain certain limitations of the policy layer.
  2. Clean up policy layer by doing do_xx and sys_xx separation [optional but this separates the dynamic bitmaps in user space from the static node maps in kernel space which I find very helpful]
  3. Add mpol_to_str to policy layer and make numa_stats use mpol_to_str.
  4. Solve the potential access issue when set_mempolicy is updating task->mempolicy while numa_stats are being displayed by taking a writelock on mmap_sem in set_mempolicy. This is in harmony with vma mempolicy updates that also take a lock on mmap_sem and that are already safe to access since numa_stats always takes an mmap_sem readlock. The patch is essentially inserting two lines.

Then I still have these evil intentions of making it possible to dynamically change memory policies from the outside. The mininum that we all need is to least be able to see whats going on.

Of course we would be happier if we would also be allowed to change policies to control memory allocation. The argument that the layer is not able to handle these is of course true since attempts to fix the issues have been blocked.

Andrew began to be swayed by these arguments. He started to favor keeping the patch in, but the debate did not reach any firm conclusion during the thread.

9. Status Of eth1394 And SBP2 Maintainership

10 Sep 2005 - 12 Sep 2005 (7 posts) Archive Link: "eth1394 and sbp2 maintainers"


People: Stefan RichterBen CollinsJody McIntyre

Stefan Richter said, "the MAINTAINERS list of Linus' tree is still listing eth1394 and sbp2 as orphaned. This is certainly not correct for sbp2. Is it for eth1394?" He said to Ben Collins, "Ben, I remember you wanted to have your contact added back in, at least for sbp2. In case this should not be true anymore, I'd volunteer for sbp2 maintenance." Ben replied, "I sent a patch to Linus, but I guess it never got added. Stefan, feel free to send a patch adding you as the maintainer" . Regarding eth1394, Jody McIntyre also said, "I emailed Steve Kinneberg, the last person to do any serious work on the driver, before I made this change, and he's OK with that. If someone else wants to take it, I suggest they submit a patch."

10. Location Of Stable Kernel Pending Patches

13 Sep 2005 - 14 Sep 2005 (3 posts) Archive Link: "Pending -stable patches"

People: Jean DelvareMichal Piotrowski

Jean Delvare asked:

Is there a place where pending -stable patches can be seen?

Are mails sent to [email protected] archived somewhere?

There seems to be a need for this. For example, there's a patch I would like to see in, but I wouldn't want to report an already known problem.

Michal Piotrowski gave Jean a link to the stable queue shortlog (;a=shortlog) , and Jean replied, "Exactly what I needed. It's bookmarked now. Thanks!"

11. udev 069 Released

13 Sep 2005 - 14 Sep 2005 (4 posts) Archive Link: "[ANNOUNCE] udev 069 release"

Topics: FS: devfs, FS: sysfs, Hot-Plugging

People: Greg KH

Greg KH said:

I've released the 069 version of udev. It can be found at: (

udev allows users to have a dynamic /dev and provides the ability to have persistent device names. It uses sysfs and /sbin/hotplug and runs entirely in userspace. It requires a 2.6 kernel with CONFIG_HOTPLUG enabled to run. Please see the udev FAQ for any questions about it: (

For any udev vs devfs questions anyone might have, please see: (

And there is a general udev web page at:

Note, I _really_ recommend anyone running 2.6.13 or newer to upgrade to at least the 068 version of udev due to some very nice speed improvemets (not to mention the fact that the 2.6.12 kernel requires at least the 058 version of udev.)

There have been lots of good bugfixes and new features added since the last time I announced a udev release, so see the RELEASE-NOTES file for details, and the changelog below.

udev uses git for its source code control system. The main udev git repo can be found at:
and can be browsed online at:

12. In Defense Of DevFS

14 Sep 2005 (7 posts) Archive Link: "devfs vs udev FAQ from the other side"

Topics: FS: devfs, FS: sysfs, Hot-Plugging, Small Systems

People: Mike BellGreg KH

Mike Bell said:

devfs vs udev: From the other side

Presuppositions (True of both udev and devfs):

  1. Dynamic /dev is the way of the future, and a Good Thing
  2. A single major/minor combination should have only a single device node, its other names should be symlinks. If you don't do this, you break locking on certain classes of applications, among other things.

The above are uncontentious as far as I know. I believe Greg KH has stated both. If you feel otherwise, please explain why.


  1. devfs creates device nodes from kernel space, and creates symlinks for alternative names using a userspace helper. udev handles both tasks from user space, by exporting the information through a different kernel-generated filesystem.

devfs advantages over udev:

  1. devfs is smaller
    Hey, I ran the benchmarks, I have numbers, something Greg never gave. Took an actual devfs system of mine and disabled devfs from the kernel, then enabled hotplug and sysfs for udev to run. make clean and surprise surprise, kernel is much bigger. Enable netlink stuff and it's bigger still. udev is only smaller if like Greg you don't count its kernel components against it, even if they wouldn't otherwise need to be enabled. Difference is to the tune of 604164 on udev and 588466 on devfs. Maybe not a lot in some people's books, but a huge difference from the claims of other people that devfs is actually bigger.

    And that's just the kernel. Then because your root is read-only you need an early userspace, and in regular userspace the udev binary, and its data files, all more wasted flash (you can shave it down by removing stuff you don't need, but that's just more work for the busy coder, and udev STILL loses on size).

    On the system in question (a real-world embedded system) the devfs solution requires no userspace helper except for two symlinks which were simply created manually in init, and could have been done away with if necessary.

  2. devfs is faster
    Despite all the many tricks that can be used to speed up udev (static linking, netlink, etc) devfs is still dramatically faster. On a big, bloated, slow-booting distribution system you may not notice so much, but when even your slowest booting systems are interactive in under 5 seconds using devfs, this is quite significant time loss.
  3. devfs uses less memory
    Check free. sysfs alone does udev in and that's just the kernel stuff that's always there.

    Also, the user space stuff may not have to run at all times in all configurations, but on a system without swap and with long-running apps, all that matters is its PEAK memory usage. If my app takes x MB and my kernel takes y MB it doesn't MATTER that udev is only running for one second, I still need more than x+yMB of memory.

udev advantages over devfs:

  1. udev has all sorts of spiffy features
    Sure, but having device nodes exported directly from the kernel in no way stops you from having those spiffy features. The problem is that udev is doing two separate tasks, and it's easy to confuse the one it should be doing with the one it shouldn't.
  2. udev doesn't have policy in kernel space
    Well, that's a bit of a lie. sysfs has even stricter policy in kernel space. What he MEANS is that udev exchanges hard-coded but symlinkable /dev paths for hardcoded sysfs paths, moving the hard-coded kernel policy from one filesystem to another.

    This argument is really the only architectural reason to go with udev. At all. If you really believe that the ability to name your hard drive /dev/foobarbaz is vital, and absolutely can't live with merely having /dev/foobarbaz be a symlink to the real device node, then you need udev. The devfs way of handling this situation was a stupid, racey misfeature and rightly deserves to die horribly.

    That said, read my comments on why flexible /dev naming is actually a bad thing and think very, very carefully about whether you actually want this "feature" at all. Symlinks are your friend.

  3. devfs is ugly
    Part of this is true, and part of this is just the perspective of certain people (Greg has this fascinating world view where code required for devfs is garbage, and code required for udev is core kernel code and doesn't count against udev, which allows him to say udev is smaller.)

    The legitimate comments about devfs being ugly... well, how many subsystems which have been largely untouched for similar periods of time aren't even uglier? TTY stuff? And it's very hard to find a maintainer for a subsystem when it's "obsolete", patches that change its behaviour aren't accepted, and certain people are so vocally opposed to its very existence. Who wants to throw away their time writing code that won't even be considered, only to be hated for it?

  4. devfs is unsupported, udev isn't
    True that. And even people who've tried to maintain devfs get turned away. So unless this document causes a few people to reexamine the need to remove devfs, you can reasonably assume that udev will be the only way to run a linux system very shortly (static /dev is already on its last legs). Me, I'll be disappointed if this happens, because as the above document indicates, I still think kernel-exported /dev is better (and not because I'm a lazy user-space-hater, Greg. :) ).

There was no real discussion in response to this. It looked as though a huge flamewar would erupt after the first few replies, but the thread petered out immediately and vanished.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.