Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #103 For 19 Jan 2001

By Zack Brown

linux-kernel FAQ | subscribe to linux-kernel | linux-kernel Archives | | LxR Kernel Source Browser | All Kernels | Kernel Ports | Kernel Docs | Gary's Encyclopedia: Linux Kernel | #kernelnewbies

Table Of Contents


Kernel Traffic will be trying out a Friday schedule, as of last week. Otherwise it's just too tempting to work through the weekend... ;-)

Mailing List Stats For This Week

We looked at 1640 posts in 7373K.

There were 516 different contributors. 261 posted more than once. 176 posted last week too.

The top posters of the week were:

1. Impact Of Sudden Power Loss On Journalled Filesystems

3 Jan 2001 - 9 Jan 2001 (58 posts) Archive Link: "Journaling: Surviving or allowing unclean shutdown?"

Topics: FS: ReiserFS, FS: ext2, FS: ext3, Web Servers

People: Michael RothwellDaniel PhillipsAlex BelitsStephen C. TweedieAndreas DilgerAlan CoxStefan TrabyDavid WoodhouseMarc LehmannDavid Lang

Dr. David Gilbert was unsure whether journalling filesystems were intended to merely survive the occassional improper shutdown, or if users should feel comfortable just powering them down as part of normal operation. Michael Rothwell pointed out that journalling filesystems only guaranteed the consistency of data that had been written prior to shutdown, and that any buffers left unflushed at power-off would be lost, and any applications not properly exited could also do bad things. "Journaling mostly means not having to run FSCK," he said. Daniel Phillips replied to David at greater length:

Welllllll... crashes tend to produce different effects from sudden power interruptions. In the first case parts of the system keep running, and bizarre results are possible. An even bigger difference is the matter of intent.

Tux2 is explicitly designed to legitimize pulling the plug as a valid way of shutting down. Metadata-only journalling filesystems are not designed to be used this way, and even with full-data journalling you should bear in mind that your on-disk filesystem image remains in an invalid state until the journal recovery program has run successfully. You would not want to upgrade your OS with your filesystem in this state, nor would you want to remove a disk drive that didn't have the journal file on it.

Being able to shut down by hitting the power switch is a little luxury for which I've been willing to invest more than a year of my life to attain. Clueless newbies don't know why it should be any other way, and it's essential for embedded devices.

I don't doubt that if the 'power switch' method of shutdown becomes popular we will discover some applications that have windows where they can be hurt by sudden shutdown, even will full filesystem data state being preserved. Such applications are arguably broken because they will behave badly in the event of accidental shutdown anyway, and we should fix them. Well-designed applications are explicitly 'serially reuseable', in other words, you can interrupt at any point and start again from the beginning with valid and expected results.

Alex Belits strongly disagreed that applications should be considered broken if they mis-handled sudden shutdowns. He said:

All valid ways to shut down the system involve sending SIGTERM to running applications -- only broken ones would live long enough after that to be killed by subsequent SIGKILL.

A lot of applications always rely on their file i/o being done in some manner that has atomic (from the application's point of view) operations other than system calls -- heck, even make(1) does that.

Daniel replied that the 'make' program in Alex's example, was a perfect case of a broken application. Alex disagreed, and the subthread petered out.

Elsewhere, Stephen C. Tweedie said in response to Daniel's statement that the on-disk filesystem image of journalling filesystems would remain inconsistent until the journal recovery program had run, "ext3 does the recovery automatically during mount(8), so user space will never see an unrecovered filesystem. (There are filesystem flags set by the journal code to make sure that an unrecovered filesystem never gets mounted by a kernel which doesn't know how to do the appropriate recovery.)" Daniel replied, "Yes, and so long as your journal is not on another partition/disk things will eventually be set right. The combination of a partially updated filesystem and its journal is in some sense a complete, consistent filesystem." But he asked, "I'm curious - how does ext3 handle the possibility of a crash during journal recovery?" Andreas Dilger explained, "Unless Stephen says otherwise, my understanding is that a crash during journal recovery will just mean the journal is replayed again at the next recovery. Because the ext3 journal is just a series of data blocks to be copied into the filesystem (rather than "actions" to be done), it doesn't matter how many times it is done. The recovery flags are not reset until after the journal replay is completed." Alan Cox replied tersely, "Which means an ext3 volume cannot be recovered on a hard disk error." And Stephen replied:

Depends on the error. If the disk has gone hard-readonly, then we need to recover in core, and that's something which is not yet implemented but is a known todo item. Otherwise, it's not worse than an error on ext2: you don't have a guaranteed safe state to return to so you fall back to recovering what you can from the journal and then running an e2fsck pass. e2fsck groks the journal already.

And yes, a badly faulty drive can mess this up, but it can mess it for ext2 just as badly.

Close by in the subthread, Stefan Traby asked, "I did not follow the ext3 development recently, how did you solve the "read-only mount(2) (optionally on write protected media)" issue ? Does the mount fail, or does the code virtually replays (without writing) only ?" Stephen explained:

The code currently checks if the underlying media is write-protected. If it is, it fails the mount; if not, it does the replay (so that mounting a root fs readonly works correctly).

I will be adding support for virtual replay for root filesystems to act as a last-chance way of recovering if you really cannot write to the root, but journaling filesystems really do expect to be able to write to the media so I am not sure whether it makes sense to support this on non-root filesystems too.

Stefan Traby had also added, "an unconditional hidden replay even if "ro" is specified is not nice. This is especially critical on root filesystem, because there is IMHO no way to specify mount arguments to the "/" mount, except ro/rw." Stephen asked, "In what way? A root fs readonly mount is usually designed to prevent the filesystem from being stomped on during the initial boot so that fsck can run without the filesystem being volatile. That's the only reason for the readonly mount: to allow recovery before we enable writes. With ext3, that recovery is done in the kernel, so doing that recovery during mount makes perfect sense even if the user is mounting root readonly." But David Woodhouse pointed out that there were other reasons for mounting a root filesystem readonly. The disk could be so damaged, he explained, that writing anything at all to it would be horribly bad; in which case one would mount it readonly, recover as much as possible from it, and throw it in the parts bin. He added, "You _don't_ want the fs code to ignore your explicit instructions not to write to the medium, and to destroy whatever data were left." But Marc Lehmann dissented:

The problem is: where did you give the explicit instruction? Just that you define "read-only" as "the medium should not be written" does not mean everybody else thinks the same.

actually, I regard "ro" mainly as a "hey kernel, I won't handle writes now, so please don't try it", like for cd-roms or other non-writeale media, and please filesystem stay in a clean state.

That ro means "the medium is never written" is an assumption that does not hold for most disks anyway and is, in the case of journlaing filesystems, often impossible to implement. You simply can't salvage data without a log reply. Sure, you can do virtual log replays, but for example the reiserfs log is currently 32mb. Pinning down that much memory for a virtual log reply is not possible on low-memory machines.

So the first thing would be to precisely define the meaning of the "ro" flag. Before this has happened it is ansolutely senseless to argue about what it means, as it doesn't mean anything at the moment, except (man mount):

ro Mount the file system read-only.

Which it does even with journaling filesystems...

Elsewhere, back on the subject of how to handle sudden shutdowns, and whether simply pulling the plug could be considered a legitimate way to end a typical single-user session, David Lang blurted, "for crying out loud, even windows tells the users they need to shutdown first and gripes at them if they pull the plug. what users are you trying to protect, ones to clueless to even run windows?" David W. replied, "Precisely. Users of embedded devices don't expect to have to treat them like computers." David L. listed in response:

in an enbedded device you can

  1. setup the power switch so it doesn't actually turn things off (it issues the shutdown command instead)
  2. run from read-only media almost exclusivly so that power event's don't bother you much
  3. you can add extra power inside the device so that if someone does pull the plug, you have a few seconds of power to do the clean shutdown
  4. you can run out of ram and force the user to do an extra step to save any changes to non-volitile storage (and if they power off during the save the results are undefined)

I have seen all of these approaches used in different devices (that are not running linux). This is not a new problem and the people working in this space have a bunch of answers.

an improved filesystem that tolorates bad shutdowns reasonably well will be welcomed for other reasons, but should not be viewed as a fix for people pulling the plug on you.

Alan said that David L.'s item #1 and #3 were too expensive, item #2 depended on the device, and item #4 was "Frowned upon because you keep getting dead units back" . He concluded, "If it doesnt fix the pulling the plug case (at least as far as 'after fsync returned this data is safe') then its not working."

2. Maximum CPUs And RAM Under 2.4 Kernels

4 Jan 2001 - 10 Jan 2001 (17 posts) Archive Link: "Confirmation request about new 2.4.x. kernel limits"

Topics: Big Memory Support, SMP

People: Anton BlanchardTigran AivazianPavel Machek

Someone asked about various limits for the 2.4 kernels. They thought SMP systems running 2.4 had a 32-cpu limit; and Anton Blanchard replied, "Max CPUs is 64 on 64 bit architectures (well you have to change NR_CPUS). I am told larger than 32 cpu ultrasparcs have booted linux already."

The original poster also thought there was a 64 Gigabyte maximum RAM size, and asked if there was any slowdowns when accessing RAM over 4G on 32-bit machines. Tigran Aivazian replied, "realistic benchmarks (unixbench) will show about 3%-6% performance degradation with use of PAE. Note that this is not "accessing RAM over 4G" but (what you probably meant) "accessing any RAM in a machine with over 4G of RAM" or even "accessing any RAM in a machine with less than 4G or RAM but running kernel capable of accessing >4G". If you really meant "accessing RAM over 4G" then you are probably talking about 36bit MTRR support which is present in recent 2.4.x kernels and works very nicely!" Pavel Machek added elsewhere, "I believe you can get few terabytes with ultrasparc."

3. ext3fs 0.0.5d And reiserfs 3.5.2x Mutually Exclusive

4 Jan 2001 - 8 Jan 2001 (3 posts) Archive Link: "ext3fs 0.0.5d and reiserfs 3.5.2x mutually exclusive"

Topics: FS: ReiserFS, FS: ext3

People: Stephen C. TweedieChris MasonMatthias Andree

Matthias Andree noticed that trying to patch ext3fs 0.0.5d onto a 2.2.18 kernel that already had reiserfs 3.5.28 would fail, because of overlapping patches in fs/buffer.c; he added that he'd reported this incompatibility some time before. Chris Mason, one of the reiserfs developers, said he'd start work on fixing it; and elsewhere, Stephen C. Tweedie, the ext3 author, said, "removing the extra debugging stuff and buffer.c code from the ext3 patches is on the todo list but is much lower priority than finishing off the tuning and user-space code for ext3-1.0."

4. Driver Submission Policy For 2.2

4 Jan 2001 - 9 Jan 2001 (30 posts) Archive Link: "Change of policy for future 2.2 driver submissions"

People: Alan CoxMark HahnDaniel PhillipsWayne BrownTim RikerRik van RielMichael D. CrawfordLinus TorvaldsNicholas Knight

Alan Cox announced:

Linux 2.4 is now out, it is also what people should be concentrating on first when issuing production drivers and driver updates. Effective from this point 2.2 driver submissions or major driver updates will only be accepted if the same code is also available for 2.4.

Someone has to do the merging otherwise, and it isnt going to be me...

There were mixed reactions to this. Nicholas Knight felt this policy was a mistake. Until the 2.4 series had stablized, he felt, 2.2 would continue to be the kernel of choice for many people, in which case Alan's policy might result in less work being done on that kernel, and thus, fewer new features in 2.2; he suggested waiting until 2.4 had reached a state where users could upgrade safely. There were several replies. Mark Hahn said:

egads! how can there be "development" on a *stable* kernel line?

maybe this is the time to reconsider terminology/policy: does "stable" mean "bugfixes only"? or does it mean "development kernel for conservatives"?

Daniel Phillips replied:

It means development kernel for those who don't have enough time to debug the main kernel as well as their own project. The stable branch tends to be *far* better documented than the bleeding edge branch. Try to find documentation on the all-important page cache, for example. It makes a whole lot of sense to develop in the stable branch, especially for new kernel developers, providing, of course, that the stable branch has the basic capabilities you need for your project.

Alan isn't telling anybody which branch to develop in - he's telling people what they have to do if they want their code in his tree. This means that when you develop in the stable branch you've got an extra step to do at the end of your project: port to the unstable branch. This only has to be done once and your code *will* get cleaned up a lot in the process. (It's amazing how the prospect of merging 500 lines of rejected patch tends to concentrate the mind.) I'd even suggest another step after that: port your unstable version back to the stable branch, and both versions will be cleaned up.

Wayne Brown objected, "In other words, there's no longer any such thing as a "stable" branch. The whole point of having separate production and development branches was to have one in which each succeeding patch could be counted upon to be more reliable than the last. If new development is going into the "stable" kernels, then there's no way to be certain that the latest patches don't have more bugs than the earlier ones, at least not without thoroughly testing them. And if testing is necessary, then we might as well just use the development kernels for everything, because we have to test them anyway." Alan replied, "By your personal definition of stable 2.0.3x is the current stable kernel."

The subthread trailed off at that point, but elsewhere, Tim Riker also replied to Nicholas criticism of Alan's initial announcement. He said:

here are some comments in Alan's favor:

He did not say people can not release 2.2 patches without 2.4 patches. He only said they will not be integrated into the kernel distribution without 2.4 patches.

If people continue to develop for 2.2 and have someone else, who is probably less familiar with the hardware, port to 2.4 for them, how soon would you trust the drivers over the 2.2 drivers?

In short, I agree with Alan completely. This is the correct move forward to cause 2.4 to become the stable release that everyone will be willing to adopt.

Rik van Riel also replied to Nicholas, regarding the suggestion that it was a mistake not to wait until 2.4 had stablized before instituting Alan's new policy. Rik said:

This is *exactly* why Alan's policy change makes sense.

If somebody submits a driver bugfix or update for 2.2, but not for 2.4, it'll take FOREVER for 2.4 to become as "trustable" as 2.2...

This change, however, will make sure that 2.4 will be as reliable as 2.2 much faster. Unlike 2.2, the core kernel of 2.4 is reliable ... only the peripheral stuff like drivers may be out of date or missing.

Elsewhere, Michael D. Crawford suggested that Linus Torvalds had arbitrarily decided to release 2.4.0 just to increase the number of people testing it. He said, "I understand Linus' desire to have more widespread testing done on the kernel, and certainly he can accomplish that by labeling some random build as the new stable version. But I think a better choice would have been to advocate testing more widely - don't just announce it to the linux-kernel list, get on National Public Radio, the Linux Journal and Slashdot and stuff." Linus replied:

You don't understand people, I think.

No amount of publicity will matter all that much in the end: yes, it will result in many people who are not afraid of a compiler to try it out. And we've had that for over six months now, realistically.

But that's very different from having somebody like RedHat, SuSE or Debian make such a kernel part of their standard package. No, I don't expect that they'll switch over completely immediately: that would show a lack of good judgement. The prudent approach has always been to have both a 2.2.19 and a 2.4.0 kernel on there, and ask the user if he wants to test the new kernel first.

That way you get a completely different kind of user that tests it.

The other thing is that even if something like 2.4.0-test8 gets rave reviews, that doesn't _matter_ to people who crave stability. The fact is that 2.4.0 has been getting quite a lot of testing: people haven't even seen how the big vendors have all done testing in their labs etc.

And to the people who really want to have stability, none of that matters. They will basically "start fresh" at the 2.4.0 release, and give it a few months just to follow the kernel list etc to see what the problems will be. They'll have people starting to ramp up 2.4.0 kernels in their own internal test environment, moving it first to machines they feel more comfortable with etc etc.

None of which would happen if you just try to make the beta testing cycle much bigger.

Which is why to _me_ the most important thing is that I'm happy with the core infrastructure - because once you've tested it to a certain degree, it's not going to improve without a real public release.

5. Modutils 2.4.0 Available

4 Jan 2001 - 8 Jan 2001 (15 posts) Archive Link: "Announce: modutils 2.4.0 is available"

People: Erik MouwWichert AkkermanAnuradha RatnaweeraKeith Owens

Keith Owens announced modutils 2.4.0 and gave a link to the sources and some RPMs. Anuradha Ratnaweera suggested also providing .deb packages, but Erik Mouw replied, "He just provides the rpms as a service, he doesn't have to do that. Install the "alien" package on your machine and you will be able to convert between rpm and deb." Wichert Akkerman replied:

Bad plan, considering packages rely on some infrastructure that is not in the rpm (update-modules). I tend to be pretty quick with making and uploading the deb anyway.

Having said that, I won't package 2.4.0 and will wait for 2.4.1 instead.

6. MM/VM Todo List

5 Jan 2001 - 8 Jan 2001 (14 posts) Archive Link: "MM/VM todo list"

Topics: Clustering, Virtual Memory

People: Rik van RielBen LaHaise

Rik van Riel announced:

here is a TODO list for the memory management area of the Linux kernel, with both trivial things that could be done for later 2.4 releases and more complex things that really have to be 2.5 things.

Most of these can be found on too

Trivial stuff:

Probably 2.5 era:

Additions to this list are always welcome, I'll put it online on the Linux-MM pages ( soon.

7. Why Use Modules?

5 Jan 2001 - 8 Jan 2001 (13 posts) Archive Link: "The advantage of modules?"

Topics: Networking

People: Michael MeissnerDrew Bertola

Evan Thompson asked if there were any real reason to prefer compiling modules as modules instead of compiling everything into the kernel binary. Drew Bertola suggested that module developers could load and unload modules for test purposes, without having to reboot the entire system. Michael Meissner said at greater length:

A couple of thoughts:

  1. A full kernel with everything compiled in might not fit on boot media such as floppies, while modules allows you to not load stuff that isn't needed to until after the main booting is accomplished.
  2. There are several devices that have multiple drivers (such as tulip, and old_tulip for example). Which particular driver works depends on your exact particular hardware. If both of these drivers are linked into the kernel, whatever the kernel chooses to initialize first will talk to the device.
  3. Having drivers as modules means that you can remove them and reload them. When I was working in an office, I had one scsi controller that was a different brand (Adaptec) than the main scsi controller (TekRam), and I hung a disk in a removable chasis on the scsi chain in addition to a tape driver and cd-rom. When I was about to go home, I would copy all of the data to the disk, unmount it, and then unload the scsi device driver. I would take the disk out, and reload the scsi device driver to get the tape/cd-rom. I would then take the disk to my home computer. I would reverse the process when I came in the morning.
  4. If you have multiple scsi controllers of different brands, building on into the kernel and the other brand(s) as modules allows you to control which scsi controller is the first controller in terms of where the disks are.

8. Bug Report Generation Tool

5 Jan 2001 - 11 Jan 2001 (43 posts) Archive Link: "[PATCHlet]: removal of redundant line in documentation"

People: Jeremy M. DolanMatthias JuchemAlan CoxRichard TorkarPavel MachekDavid FordRafael E. Herrera

In the course of discussing patch submissions, Jeremy M. Dolan suggested:

why not include a script which takes care of ALL the leg work? All of the files it asks the reporter to include are o+r...

I can whip up a bug_report script to walk the user though all of the steps in REPORTING-BUGS, if the list isn't averse to 'dumbing down' the process to the point where maybe some people who shouldn't be submiting bugs (two words: 'user error') end up not being scared off by the process.

Is perl allowed for kernel scripts intended for users, or am I stuck with sh?

Matthias Juchem that he'd already started work on such a script, and Pavel Machek had some suggestions.

Elsewhere, under the Subject: [PATCH] new bug report script, Matthias posted a patch against 2.4.0 and explained, "It introduces a new bug reporting script (scripts/ that tries to simplify bug reporting for users. I have also added a small hint to this script to REPORTING-BUGS." There was some discussion of possible fixes to the script, but elsewhere, Alan Cox objected, "The kernel doesnt require perl. I don't want to add a dependancy on perl." Matthias pointed out several other perl scripts in the official sources, and suggested making it optional. But Alan replied, "None of these are needed for normal build/use/bug reporting work. In fact if you look at script_asm you'll see we go to great pains to ship prebuilt files too." Matthias argued, "Why can't I assume that perl is installed? It can be found on every standard Linux/Unix installation. And besides, the bug report script doesn't replace anything the doesn't need perl - ver_linux, REPORTING-BUGS and oops-tracing.txt are still there for the more advanced user. My script is intended for the one who likes to provide bug reports but is too lazy to look up all the information or simply is not sure about what to include."

David Ford asked why the script couldn't be done as a shell script, and Matthias replied:

It can be done in sh, surely. I only tried to promote my perl version because I've done it in perl and nobody told me earlier that perl is not liked in the kernel tree - and I've seen some perl scripts there.

I guess I'll have to convert the script to sh.

Elsewhere, under the Subject: bugreporting script - second try, Matthias announced, "I rewrote my previous in bash. I would appreciate it if you had a look on this one. Run it once and give me feedback if you like." Richard Torkar reported success with it, though he'd been unable to test the ksymoops feature. After some more feedback from Richard, Matthias posted a link to a new version. Rafael E. Herrera posted a patch to the script, to enable the use of /proc/config.gz if any were available. Matthias liked this idea and adopted it into the script.

9. Patch Submission Policy For 2.4

6 Jan 2001 - 10 Jan 2001 (7 posts) Archive Link: "Linux-2.4.x patch submission policy"

Topics: FS: ramfs, Virtual Memory

People: Linus TorvaldsAlan CoxRik van RielAndrew Morton

Linus Torvalds stated:

I thought I'd mention the policy for 2.4.x patches so that nobody gets confused about these things. In some cases people seem to think that "since 2.4.x is out now, we can relax, go party, and generally goof off".

Not so.

The linux kernel has had an interesting release pattern: usually the .0 release was actually fairly good (there's almost always _something_ stupid, but on the whole not really horrible). And every single time so far, .1 has been worse. It usually takes until something like .5 until it has caught up and surpassed the stability of .0 again.

Why? Because there are a lot of pent-up patches waiting for inclusion, that didn't get through the "we need to get a release out, that patch can wait" filter. So early on in the stable tree, some of those patches make it. And it turns out to be a bad idea.

In an effort to avoid this mess this time, I have two guidelines:

In short, releasing 2.4.0 does not open up the floor to just about anything. In fact, to some degree it will probably make patches _less_ likely to be accepted than before, at least for a while. I want to be absolutely convicned that the basic 2.4.x infrastructure is solid as a rock before starting to accept more involved patches.

Right now my ChangeLog looks like this:

The first two are true one-liners that have already bitten some people (not what I'd call a showstopper, but trivially fixable stuff that are just thinkos). The third one looks like a real fix for some rather common hardware that could do bad things without it.

Now, I'm sure that ChangeLog will grow. There's the apparent fbcon bug with MTRR handling that looks like a prime candidate already, and I'll have people asking me for many many more. But basically what I'm asking people for is that before you send me a patch, ask yourself whether it absolutely HAS to happen now, or whether it could wait another month.

Another way of putting it: if you have a patch, ask yourself what would happen if it got left off the next RedHat/SuSE/Debian/Turbo/whatever distribution CD. Would it really be a big problem? If not, then I'd rather spend the time _really_ beating on the patches that _would_ be a big issue. Things like security (_especially_ remote attacks), outright crashes, or just totally unusable systems because it can't see the harddisk.

We'll all be happier if my ChangeLog is short and sweet, and if a 2.4.1 (tomorrow, in a week, in two, in a month, depending on what comes up) actually ends up being _better_ than 2.4.0. That would be a good new tradition to start.

And before you even bother asking about 2.5.x: it won't be opened until I feel happy to pass on 2.4.x to somebody else (hopefully Alan Cox doesn't feel burnt out and wants to continue to carry the torch and feels ok with leaving 2.2.x behind by then).

Historically, that's been at least a few months. In the 2.2.x series, 2.3.0 was the same as 2.2.8 with just the version changed - and it came out in May, almost four months after 2.2.0. In the 2.0.x series, 2.1.x was based off 2.0.21, four and a half months after 2.0.0.

Yes, I know this is boring, and all I'm asking is for people to not make it any harder for me than they have to. Think twice before sending me a patch, and when you _do_ send me a patch, try to think like a release manager and explain to me why the patch really makes sense to apply now.

In short, I'm hoping for a fairly boring next few months. The more boring, the better.

Alan Cox added regarding his own patches, "Think of -ac as a way to get patches you need that everyone else might not need yet, and a way to filter stuff. Im happy to take sane stuff Linus doesn't (within reason) and propogate it on as (or more to the point if) it proves sane." Rik van Riel also volunteered to "gather all non-bug VM patches and combine them into a special big patch periodically. Once we are sure 2.4 is stable for just about anybody I will submit some of the really trivial enhancements for inclusion; all non-trivial patches I will maintain in a VM bigpatch, which will be submitted for inclusion around 2.5.0 and should provide one easy patch for those distribution vendors who think 2.4 VM performance isn't good enough for them ;)"

10. Bug In 2.4.0 Virtual Memory Subsystem

8 Jan 2001 - 10 Jan 2001 (19 posts) Archive Link: "VM subsystem bug in 2.4.0 ?"

Topics: Virtual Memory

People: Rik van RielLinus TorvaldsStephen C. TweedieTim WrightChristoph Rohland

Sergey E. Volkov was testing an Informix IIF-2000 database server running on a dual Intel Pentium II 233MHz; when running 'make -j30 bzImage' on the kernel source tree, the system would completely hang. Trying the same thing on the same machine without Informix running, no hang occurred. He suspected the problem was that Informix allocated about 50% of the system's RAM as locked shared memory. So the kernel would try to swap out the locked segments, fail, and wait forever for them to swap out. Rik van Riel replied:

You are right. I have seen this bug before with the kernel moving unswappable pages from the active list to the inactive_dirty list and back.

We need a check in deactivate_page() to prevent the kernel from moving pages from locked shared memory segments to the inactive_dirty list.

He asked for advice from Christoph Rohland and Linus Torvalds, and Linus suggested:

The only solution I see is something like a "active_immobile" list, and add entries to that list whenever "writepage()" returns 1 - instead of just moving them to the active list.

Seems to be a simple enough change. The main worry would be getting the pages _off_ such a list: anything that unlocks a shared memory segment (can you even do that? If the only way to unlock is to remove, we have no problems) would have to have a special function to move all pages from the immobile list back to the active list (and then they'd get moved back if they were for another segment that is still locked).

Rik suggested just having a special "do not deactivate me" data-bit for each item on the list. "When this special bit is set," he said, "we simply move the page to the back of the active list instead of deactivating." He added, "when the bit changes again, the page can be evicted from memory just fine. In the mean time, the locked pages will also have undergone normal page aging and at unlock time we know whether to swap out the page or not." He admitted that this method would have higher overhead than Linus', but it seemed simpler and more flexible to him. Stephen C. Tweedie objected that he didn't see a way to clear the bit properly, saying, "Locking is a per-vma property, not per-page. I can mmap a file twice and mlock just one of the mappings. If you get a munlock(), how are you to know how many other locked mappings still exist?" Linus replied:

Note that this would be solved very cleanly if the SHM code would use the "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying to lock them down for writepage().

That would mean that such a segment would still get swapped out when it is not mapped anywhere, but I wonder if that semantic difference really matters.

If the vma is marked VM_LOCKED, the VM subsystem will do the right thing (the page will never get removed from the page tables, so it won't ever make it into that back-and-forth bounce between the active and the inactive lists).

Christoph posted a lightly tested patch, and Linus asked:

I'd really like an opinion on whether this is truly legal or not? After all, it does change the behaviour to mean "pages are locked only if they have been mapped into virtual memory". Which is not what it used to mean.

Arguably the new semantics are perfectly valid semantics on their own, but I'm not sure they are acceptable.

In contrast, the PG_realdirty approach would give the old behaviour of truly locked-down shm segments, with not significantly different complexity behaviour.

What do other UNIXes do for shm_lock()?

The Linux man-page explicitly states for SHM_LOCK that

The user must fault in any pages that are required to be present after locking is enabled.

which kind of implies to me that the VM_LOCKED implementation is ok. HOWEVER, looking at the HP-UX man-page, for example, certainly implies that the PG_realdirty approach is the correct one. The IRIX man-pages in contrast say

Locking occurs per address space; multiple processes or sprocs mapping the area at different addresses each need to issue the lock (this is primarily an issue with the per-process page tables).

which again implies that they've done something akin to a VM_LOCKED implementation.

Does anybody have any better pointers, ideas, or opinions?

In terms of how other UNIXes handled the situation, Tim Wright said:

It appears that the fine-detail semantics vary across the board. DYNIX/ptx supports two forms of SysV shm locking - soft and hard. Soft-locking (the default) merely makes the pages sticky, so if you fault them in, they stay in your resident set, but don't count against it. If, however the process swaps, they're all evicted, and when the process is swapped back in, you get to fault the back in all over again. Hard locking pins the segment into physical memory until such time as it's destroyed. It stays there even if there are currently no attaches. Again, such pages are not counted against the process RSS.

SVR4 only support one form. It faults all the pages in and locks them into memory, but doesn't treat the especially wrt rss/paging, which seems none too clever - if they're locked into memory, you might as well use them :-)

The discussion ended around there.

11. Superfluous Whitespace In The Kernel Sources

8 Jan 2001 (4 posts) Archive Link: "Extraneous whitespace removal?"

People: David WeinehallRusty RussellJeremy M. Dolan

Jeremy M. Dolan took all whitespace off of the ends of lines in the kernel sources, removing almost 200 K and producing almost a 2 M patch. David Weinehall replied:

While I really like the idea with this patch, I'm 100% certain that Linus would not, under any circumstances, accept this patch.

I suggest that we instead force everyone to program with:

syntax on
let c_space_errors=1

(Or equivalent Emacs/[insert favourite editor here]-setting instead)

While at it, force people to read linux/Documentation/CodingStyle and make them adhere to it.

Of course, I guess this is a free world (yeah, right) and everyone should have the right to code in their own way, but I'd wish that people at least could be consistent when indenting/spacing/bracing/whatever, and when patching other people's code, also follow the already set standard of that file instead of introducing a new one...

Rusty Russell added, "I've done this before, but never posted it, lest they think I'm insane. I vote this for 2.5.1." He suggested listing Jeremy in the MAINTAINERs file as the official whitespace maintainer.

12. 2.0.39 Announced

9 Jan 2001 - 10 Jan 2001 (7 posts) Archive Link: "[Announcement] linux-kernel v2.0.39"

Topics: CREDITS File, Disks: IDE, FS: devfs, FS: ext2, FS: smbfs, MAINTAINERS File, Networking, PCI

People: David WeinehallMatthew GrantJan KaraStephen C. TweedieJari RuusuAndries BrouwerAlan CoxIvan PassosAndrea ArcangeliAndre HedrickJean TourrilhesRichard Gooch

David Weinehall announced 2.0.39:

Everyone laughs, I guess. The 2.0.39final didn't became the final release (could've told you so...) The good thing? Well, some bugs were found and removed. But this is it. Enjoy!

Changelog for v2.0.39

13. 2.4.0 On The IA64

10 Jan 2001 (4 posts) Archive Link: "2.4.0 release and ia64"

People: Bill Nottingham

Someone asked if 2.4.0 would run on the IA64 or if some special patches were required. Bill Nottingham replied, "There's a patch for it in ports/ia64 on your favorite linux kernel mirror." The original poster replied that those patches appeared to be only for test kernels, not the official 2.4.0 release. Bill replied:

There *should* be a patch for 2.4 final:


If not, your mirror isn't up to date.

14. Statistical Kernel Profiler Available

10 Jan 2001 - 11 Jan 2001 (2 posts) Archive Link: "[ANNOUNCE] oprofile profiler"

People: John LevonKarim Yaghmour

John Levon announced:

oprofile is a low-overhead statistical profiler capable of instruction-grain profiling of the kernel (including interrupt handlers), modules, and user-space libraries and binaries.

It uses the Intel P6 performance counters as a source of interrupts to trigger the accounting handler in a manner similar to that of Digital's DCPI. All running processes, and the kernel, are profiled by default. The profiles can be extracted at any time with a simple utility. The system consists of a kernel module and a simple background daemon.

Typical overhead is around 3 or 4 percent. Worst case overhead on a Pentium II 350 UP system is around 10-15%

You can read a little more about oprofile, and download a very alpha version at :

oprofile is released under the GNU GPL.

Karim Yaghmour replied:

This is really interesting. Great stuff.

As Alan had once suggested, it would be very interesting to have this information correlated with the content of the traces collected using the Linux Trace Toolkit ( For instance, you could see how many cache faults the read() or write() operation of your application generated and other unique info. It would also be possible to enhance the post-mortem analysis done by LTT to take in account this data. You could also use LTT's dynamic event creation mechanism to log the profiling data as part of the trace.

There are definitely opportunities for interfacing/integrating here.

Let me know what you think.

There was no reply.

15. LVM Fixes Slow To Get Into The Official Kernel

10 Jan 2001 (5 posts) Archive Link: "Oops in 2.4.0 (@ LVM)"

Topics: Disk Arrays: LVM, Version Control

People: Andreas DilgerPaul Jakma

Gustavo Zacarias got an oops from LVM running under 2.4.0, and Andreas Dilger replied:

There is a patch to the LVM kernel code which should help:

You should also get the LVM user tools from CVS (with TAG LVM_0-9-patches) to solve this problem. There will hopefully be a new LVM release soon.

Paul Jakma asked, "any word on when the kernel fixes are going to linus?" Andreas replied, "I've heard "soon" on the LVM list, but I'm just one of the chickens. If it were up to me, the fixes would go to Linus as soon as they are found." And Paul said, "indeed. it looks bad when code is updated irregularly, and it's a pain for users." End Of Thread.

16. Comparing Khttpd, Boa, And Tux

11 Jan 2001 - 13 Jan 2001 (12 posts) Archive Link: "khttpd beaten by boa"

Topics: Web Servers

People: Christoph LameterLars Marowsky-BreeDavid S. MillerH. Peter AnvinArjan van de VenDean GaudetAlan Cox

Christoph Lameter reported losing an argument over which web server was faster, khttpd or boa. He posted some numbers and said that in the first test, "boa won hands down because it supports persistant connections." They'd ran the same test with persistant connection turned off, but found that boa still won. He said:

This shows the following problems with khttpd:

1. Connect times are on average longer than boa. Why???

2. Transfers also take longer,

What is wrong here?

Lars Marowsky-Bree replied disgruntledly, "This just goes on to show that khttpd is unnecessary kernel bloat and can be "just as well" handled by a userspace application, minus some rather very special cases which do not justify its inclusion into the main kernel." David S. Miller added, "My take on this is that khttpd is unmaintained garbage. TUX is evidence that khttpd can be done properly and beat the pants off of anything done in userspace." H. Peter Anvin suggested, "Then why don't we unload khttpd and put in Tux?" Elsewhere, Arjan van de Ven remarked, "TuX is certainly the "next and better" generation, and I look forward to working with Ingo and others on it." But Alan Cox mentioned that, since tux required the 'zero copy' patches, those patches would have to go in before Tux could be considered.

Elsewhere, under the Subject: khttpd beats boa with persistent patch, Christoph said with glee:

I applied the persistent khttpd patch + my vhost patch and now khttpd beats boa!!! (patch against 2.4.0 follows at the end of the message)

The connection times of boa are still better but khttpd wins in transfers.

Dean Gaudet pointed out that running the test locally ignored network latencies, and was thus a meaningless benchmark. He explained, "latency is as important, or even more important than raw throughput. anything beyond a second or two is the point where humans start giving up on the server. if you study a real benchmark such as specweb99 you'll find that if you don't have good response latency then your score is not valid. they actually have a minimum throughput that each connection must meet or else it's considered an error -- it's similar to having a latency budget, with some slight differences."

17. Unexplained 2.4.0 Filesystem Corruption

12 Jan 2001 - 14 Jan 2001 (15 posts) Archive Link: "2.4 ate my filesystem on rw-mount"

Topics: Disks: IDE

People: Tobias RingstromAlan CoxVojtech Pavlik

Tobias Ringstrom gave a hair-raising account of his 2.4.0 experiences:

I've never seen anything like it before, which I'm happy for. The system had been running a standard RedHat 7 kernel for days without any problems, but who wants to run a 2.2 kernel? I compiled 2.4.0 for it, rebooted, and blam! The RedHat init stripts got to the "remounting root read-write" point, and just froze solid.

Rebooting into RH7 failed, becauce inittab could not be found. In fact the filesystem was completely messed up, with /dev empty, lots of device nodes in /etc, and files missing all over the place. I had to reinstall RH7 from scratch.

I do not understand how this could happen during a remounting root rw. Is the filesystem really that unstable?

Am I right in suspecting DMA, which was enabled at the time? Any other ideas? Is it a known problem?

This is on a 450 MHz AMD-K6 with the following IDE controller:

00:07.1 IDE interface: VIA Technologies, Inc. VT82C586 IDE [Apollo] (rev 06)

Alan Cox replied, "There are several people who have reported that the 2.4.0 VIA IDE driver trashes hard disks like that. The 2.2 one also did this sometimes but only with specific chipset versions and if you have dma autotune on (thats why currently 2.2 refuses to do tuning on VP3)"

Vojtech Pavlik also replied to Tobias, saying, "Wow. Ok, I'm maintaining the 2.4.0 VIA driver, so I'd like to know more about this." He asked for specific hardware details, which Tobias provided, and they went back and forth for a bit, though no solution appeared on the list.

18. PowerPC In The Official Tree

13 Jan 2001 (2 posts) Archive Link: "PPC 2.4 ?"

People: Cort DouganGiuliano Pochini

Giuliano Pochini asked when the PowerPC tree would be merged into the official sources, since none of the official versions would even compile. Cort Dougan replied:

Grab a tree from Those always compile and are up-to-date.

I send patches, but they don't always make it into the main tree. In the mean time, you have a consistent source of kernels with the above web site.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.