Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #33 For 7 Sep 1999

By Zack Brown

Table Of Contents


After various back and forth trying to find the right day, I think the schedule of KT has settled on Mondays Mornings, around 10AM Pacific time. I know, today's Tuesday, but that's a holiday thing.

As announced on Linux Today, the new URL for KT is, so you should update your bookmarks, even though the old site should continue to work for a little while. The KT/KC layout will also be changing soon to be an integral part of the Linuxcare pages. If you have special browser needs, check out the Linuxcare site and make sure you can view it comfortably. If not, let me know.

Mailing List Stats For This Week

We looked at 1158 posts in 4802K.

There were 462 different contributors. 205 posted more than once. 184 posted last week too.

The top posters of the week were:

1. Linux On Pentium III

30 Aug 1999 (2 posts) Archive Link: "need PIII advices"

People: Lee HetheringtonKurt GarloffDoug Ledford

FAVRE Gregoire got a Pentium III and was curious how he could optimize a kernel for it. Lee Hetherington replied that Doug Ledford had some new PIII patches, but that Lee hadn't tried them yet. He added, "I am currently running patch fx-2.2.5-A4 that Kurt Garloff sent me, and that is running fine. You will need to specify PIII at configure time. To actually assemble the new MMX2/KNI/SSE/XMM (insert your favorite name here) instructions you will need binutils- I am using"

2. ext2fs Patches For Speed And Recovery

14 Aug 1999 - 27 Aug 1999 (12 posts) Archive Link: "[PATCH] roubust ext2fs against failure"

Topics: FS: ext2

People: Hirokazu TakahashiHubert TonneauTheodore Y. Ts'o

Hirokazu Takahashi gave a pointer to his Linux: Robust Ext2fs Against Faults page, which had a patch (against 2.2.x and 2.3.x) to make ext2fs recover more easily from crashes. It also significantly improved performance times for file operations. The web page has a lot of information about the patch, but in his post, he explained:

My first aproach to implement them is 'fail-safe and fail-soft':

Second, minimize loss of performance:

Sang-yong Suh thought the ideas were good, and asked if the performance improvements applied to "async" mounted partitions, but Hirokazu replied that no, only partitions mounted "sync" would see those benefits. "async" mounted partitions would have the same performance as normal Linux.

Hubert Tonneau tried the patch on 2.2.12-pre4 and reported good success, but added, "this is the kind of patch that I'd really like to see comments on from Linux big names. Either it's conceptualy broken and I'd (and probably many others) like to know why, or it's the kind of thing that we definetly want in all standard trees, even if it's not 100% granting agains meta data corruption."

Theodore Y. Ts'o came into the conversation at one point, regarding the peripheral issue of dealing with file corruption after crashes. He didn't weigh in strongly either for or against the patches, but one of his comments was, "the funny thing is that if you have to run e2fsck after each system crash anyway, it's not worth it to do some of the things detailed in the tech note. For example, it talks about wanting to clear the indirect block to disk before linking it in to the inode, lest garbage in the indirect block look like valid blocks, and so you have blocks claimed by multiple inodes. True --- but e2fsck detects this case and fixes it. So it's not at all clear it's worth the performance hit to clear the indirect block first."

3. Many SMP Races In 2.3.13

16 Aug 1999 - 24 Aug 1999 (12 posts) Archive Link: "[patch] possible SMP races all over the place in wait_event interface"

Topics: SMP

People: Andrea ArcangeliLinus Torvalds

Andrea Arcangeli couldn't believe his eyes, but it seemed like there were SMP race conditions in the wait_event code, requiring read and write ordering enforcement to be added to many spots in the source. He posted a sample of such a change against 2.3.13; Linus Torvalds replied that part of the patch was not necessary, but that the rest was indeed a problem. Andrea posted a large patch against 2.3.13 and various folks made some corrections.

4. Magic Sysrq For Serial Consoles

18 Aug 1999 - 25 Aug 1999 (7 posts) Archive Link: "PATCH: magic sysrq for serial console"

People: Miquel van SmoorenburgScott LairdLinus Torvalds

Miquel van Smoorenburg posted a patch that would make sysrq work on a serial console. He explained, "You just send a BREAK and then within 5 seconds a command key. You need to enable both serial console support and magic sysrq support in the kernel to use this. You do need a getty or anything else running on the serial device, since the serial driver only sees interrupts and BREAKs if the port is actually opened!"

A number of folks burst into applause, and Scott Laird posted an additional patch for redisplaying the last 1k of the message ring buffer and for faking a Ctrl-Atl-Delete. Miquel liked the addition and said he'd add it to his submission to Linus Torvalds later in the week.

5. Ramdisks Blocksize And 2.3.x Problems

20 Aug 1999 - 24 Aug 1999 (19 posts) Archive Link: "[patch] ramdisk blocksize"

Topics: Big Memory Support, FS: ext2, Security, Virtual Memory

People: Andrea ArcangeliMike KlarLinus Torvalds

Andrea Arcangeli posted a patch to fix a bug that was causing the ramdisk driver to crash when given blocksizes larger than 512 bytes. He also offered some interesting advice:

People who wants to use extensively ramdisks should use a blocksize of PAGE_SIZE to avoid the fragmentation of not-freeable memory. NOTE: you can't generate a 4k filesystem using a blocksize of 1k simply because the buffer-size used from mke2fs must be the same used by the filesystem code otherwise you won't read on the memory you written before (on a ramdisk you can't emulate larger softblocksize). So to use a blocksize of 4k you must:

        insmod rd rd_blocksize=4096
        mke2fs -b 4096 /dev/ramdisk
        mount /dev/ramdisk /mnt

then you'll avoid the fragmentation of protected buffers (really I don't ever know if the ramdisk driver will work correctly on 2.3.14-pre2 but that's a different issue and it has only to deal with the VM not with the ramdisk internals that have only to set the protected bit upon lowlevel writes).

Insmodding rd without parameters you'll use a hardblocksize of 1k and everything will work as usual. btw currently rd.c works as usual even with 512 byte of hardblocksize because both the blockdevice layer, and ext2fs prefer to start reading/writeing with a softblocksize of 1k (and mke2fs generate filesystem with 1k blocksize as default). With the fixes now the hardblocksize and the softblocksize will be both set to 1k and this looks cleaner to me.

The last issue is with the ramdisk images: the helper function (identify_ramdisk_image) that reads the superblock only check how many blocks it contains but it doesn't check the blocksize of the filesystem. So basically you must use a blocksize of 1k if you want rd_load_image to work properly. At least now it's a know feature and the check at the end of the patch will be done right.

Regarding ramdisks under 2.3.x, Bradley D. LaRonde said that according to the tests he had seen, the ramdisk driver was broken past 2.3.6; he posted a hack from Mike Klar that worked but was not a full fix; and posted a fascinating explanation (by Mike) about what was going wrong:

OK, here's a summary of what I see to be the problem with ramdisk as of 2.3.7. Note that it's partly predicated on a not-so-good understanding of the page cache, so some of the analysis is very subject to error:

First the problem: With kernel version 2.3.7 , and continuing through the current 2.3.13 version, the ramdisk driver does not function properly when mounted as root, whether loaded from initrd or from a floppy disk. Non-root ramdisk can also function improperly if direct device IO and file IO are both used on a ramdisk (for example, by loading a filesystem image by dd, then mounting it and attempting to read files). I believe this is indicative of a broader, but more subtle, problem that could affect any device.

The specific symptoms of the ramdisk problem: The filesystem is recognized, mounts properly, directory reads are OK, but file reads return bogus data. When using ramdisk as root, the kernel fails with "Kernel panic: No init found. Try passing init= option to kernel." regardless of whether there is a valid init= option present. Debugging has revealed that the kernel does find the directory entry for the init executable, but when it reads the executable file, it gets back all 00s, and so it rejects the file as non-executable. If a (non-root) ramdisk is prepared with mke2fs, mounted, files are written, then read, that appears to work OK.

The short version of why it's happening: Ramdisk keeps its data in the buffer cache, file IO checks the page cache. The file reads are missing in the page cache, and creating new buffer entries (filled from the ramdisk device, which will return 00s), even though the data already exists in the buffer cache.

The longer version of why it's happening: The rd device driver itself doesn't know anything about the actual data, it lets the buffer cache worry about storing the data and servicing reads. Whenever a read request actually gets to the rd device, it assumes that since it must have missed in the buffer cache, it's a new block that was never accessed before, so it just returns 00s.

When a ramdisk is mounted as root, the image is loaded into the ramdisk at boot-time via direct device IO, which creates buffer cache entries, but not page cache entries.

File IO reads then check the page cache, which keeps its data in the buffer cache, but is searched in a separate hash. If the data wasn't written via file IO, it will miss in the page cache search and issue a read request directly to the device. The ramdisk thinks it missed in the buffer cache, so it returns 00s. This results in 2 separate buffers for the same block on the ramdisk, only one of which has a page cache entry. Because of the way the ramdisk works, the second copy is wrong (filled with 00s).

This problem could affect any other device that mixes direct device IO and file IO as well, but the consequences are more subtle. If a buffer cache (but not a page cache) entry exists for a particular block of a device, then a file IO read (or even worse, a write) takes place to that same block, the read will miss in the page cache and create new buffers. For any device other than ramdisk, the new buffers will be filled from the physical device itself, so the data at least has a good chance of being valid. You still wind up with 2 separate buffers claiming to represent the same block on the device, not a good thing. Aside from the inefficiency (both space and speed), if those 2 buffers ever get to hold different versions of the data (which I assume would happen if a write takes place to that block), you've got big big trouble.

Some possible solutions: The obvious answer to the ramdisk problem would seem to be having it store its data in the page cache, but that would probably result in throwing out many of the advantages of the current ramdisk implementation (which is simple, efficient, and clean), and also wouldn't do anything about the broader problem. I don't like this approach because I think the problem is with the page cache implementation, not with the ramdisk implementation.

On the broader problem, I don't see any alternative to either checking the buffer cache on page cache misses, or fundamentally changing either the buffer cache or page cache implementations. If the buffer cache is checked on page cache misses, what to do if a block hits in the buffer cache, but not the page cache, is a fuzzier issue. Should the old buffer be flushed and invalidated, then a new buffer created and filled from the device? That seems unnecessarily inefficient, and would still leave the current ramdisk broken. Should the new buffer be filled from the old buffer, then the old buffer invalidated? Or should the page cache entry be created to use the old buffer without moving the data (which is not nearly as simple as it sounds)? Or something I haven't thought of?

After some further discussion, Andrea posted a patch, and said, "This patch will fix the ramdisk but I found some major bug (that can lead to data corruption) in the page/buffer cache (not ramdisk related) and I think that once I'll have fixed them the ramdisk won't need any change. So take the below patch as a temp (wrong) fix."

He replied to himself with a patch an hour and a half later, explaining:

Ok, I just have a preliminary patch that try to fix the potential data corruption that can happens in 2.3.15-pre1 and previous 2.3.x kernels (and that will automagically fix the ramdisk driver without changing its internals).

The corruption bug (that has nothing to do with the ramdisk driver) is the use of truncate_inode_page() to shrink the icache. If an inode is not in-use and it's hashed in the icache, it can have dirty or protected pages allocated in its page cache.

So when we shrink the icache so we need to release also all the page-cache pages that belongs to such inode, we can't simply mark all the page-cache-overlapped-buffers as clean in flushpage. Otherwise we'll lose data-writes and this will lead to data corruption on disk.

Previously (in 2.2.x) it was possible to use truncate_inode_pages without differences (both for shrink the icache and for truncate(2)/unlink), since the page cache was only there for reads, and both writes and protected buffers was placed in the buffer cache. This is not possible anymore since now the pagecache has dirty or protected data in it.

NOTE: with my patch applyed the blockdevice writes are still not synchronized with filesystem writes (this avoids us having to hash in the buffer-hashtable the page-cache-overlapped-buffers). So if you read from the blockdevice layer you shouldn't expect to read the last uptodate data and if you write to the blockdevice layer your writes can be lost. So just choose if to use a blockdevice in raw mode or with a filesystem on the top of it, before start using it ;).

But with the patch applyed it should be guaranteeed that if you unmount a ramdisk, and _then_ you read the ramdisk from the blockdevice layer, you'll read the right data (the page-cache will be correctly converted to regular buffers and not to orphaned-lost buffers).

Please test the ramdisk driver heavily with only this patch applyed against 2.3.15-pre1. Try to generate ramdisk images using the ramdisk itself. Make sure to always unmount before accessing the ramdisk via /dev/ram*. If you get in troubles give me a way to reproduce please ;).

After a bit of a bug-hunt, he posted an incremental patch to be applied on top of the previous one, and then another. At this point there was some confusion that was not resolved in the thread. Linus Torvalds pointed out that Andrea's third patch would cause major filesystem corruption that fsck would not catch. Andrea replied with the explanation that there had been a hack applied to the kernel between 2.2.x and 2.3.x that would solve the problem he was addressing, but would allow an exploitable DoS attack. He felt his solution was the "proper" one and posted the full patch (no incremental pieces) to avoid confusion. Linus replied that there was no DoS exploit, adding "You seem to have complicated the code just because you didn't notice that we already handled it with a very simple test" . Andrea posted some code to leak inodes in 2.3.15pre2, explaining, "run the above proggy on a bigmem machine and you'll leak lots of memory in unfreeable inodes. With the default setting if inode-max == 16376 I am been able to enlarge the icache to 30000 inodes (but only because I have only 128mbyte of ram, if shrink_mmap triggers leather then you'll have far more fun). 30000 inodes are at least 8mbyte of unfreeable memory. You'll have to reboot to shrink the icache."

Linus said he'd accept patches to make inodes freeable, adding, "It has actually been on my list of things to do for a while, but I never got around to it. If you make inodes freeable, ALL of the inode complexity just goes away (forget about the inode-max stuff and the nasty logic to make sure that we try to free dentries etc even if we have tons of memory left)," and said that Andrea was working around the problem.

Andrea pointed out that he had been trying to fix the ramdisk driver; and the thread pretty much died.

6. ISDN Still Unsettled

22 Aug 1999 - 25 Aug 1999 (12 posts) Archive Link: "Huge patches such as ISDN"

Topics: Networking

People: Linus Torvalds

There was a bit of discussion about ISDN this week, regarding its inclusion in the Linus Torvalds tree. Linus had nothing to say about it this time, which would seem to indicate that the question remains, will the ISDN people start presenting him with timely patches. And apparently the answer is, not yet.

7. General-Purpose PCI Driver

22 Aug 1999 - 25 Aug 1999 (4 posts) Archive Link: "General purpose serial PCI driver available for testing"

Topics: PCI

Otto Moerbeek gave a link to Otto's LinuxPPC Goodies page and announced a patch for the serial driver to provide easy support for any "dumb" serial PCI card. Apparently there were some behind-the-scenes replies, because he posted a few updates shortly thereafter.

8. Sony MiniDisc Filesystems

23 Aug 1999 - 24 Aug 1999 (21 posts) Archive Link: "Streaming disk I/O kills file buffering and makes Linux unusable"

People: Chuck Lever

In the course of discussion, Chuck Lever gave an interesting critique:

i've been very impressed with the Sony MiniDisc file system -- it's a simplified file system that transparently manages data on MO MiniDiscs.

the name space is flat -- you can have 1 to 255 separate "tracks", and i've seen this system used for monaural or up to 8 concurrent channels of recording, on 128M MO disks. i'm no expert, but i'd bet a faster, larger disk could be used to boost the number of concurrent channels. it's probably just a matter of how many channels can be multiplexed into a track's block stream.

a simple TOC-based block-allocation system is used. the TOC is stored in the disk writer's volatile memory, and written out when the disk is ejected; this eliminates the extra seeks involved with metadata updates. new blocks are allocated sequentially from free space at the end of the disk, or eventually from blocks freed by erasing tracks of previously recorded material. a text area of about 2K per disk is reserved for tagging each track with "title" data.

for managing a large disk, one might extend such a file system to handle more tracks, more blocks, and generate a periodic TOC update that writes the TOC into special areas allocated across the entire disk to minimize seek latency.

this kind of system would eliminate the need for indirect blocks or even extent-based allocation, and keep metadata updates very cheap. it would also make it easy to combine, divide, and otherwise edit the data in the tracks.

9. VAIO Compatibility Questions With 2.3.11 And Higher

23 Aug 1999 - 24 Aug 1999 (7 posts) Archive Link: "Problems with Linux 2.3.1[1-4] on a Sony VAIO laptop"

People: Theodore Y. Ts'o

Theodore Y. Ts'o reported that his VAIO 505TX was refusing to boot any kernel after 2.3.10; He added, "I've isolated it to approximately 150k worth of diffs, but unfortunately I can't narrow it down any further since the changes involve moving away from a single TSS per process to a single TSS per CPU, and so the changes touch a huge number of files and are interrelated, so I can't pull one out without pulling them all out."

Thomas Davis also had a 505TX but had no trouble getting 2.3.12 running on it. Ted replied that he was using LILO v. 21, with 128M RAM and 6G HD space. Thomas replied that he had 64M RAM, 6.4G HD, APM turned on, linux 2.3.13 (moving right along), pcmcia-cs.09-Aug-99, and BIOS Version R0113R5. End Of Thread.

10. APM And SMP

23 Aug 1999 - 25 Aug 1999 (8 posts) Archive Link: "Odd APM oops"

Topics: SMP

People: Alan CoxStephen Rothwell

Mike Ricketts found that "power off on shutdown" was causing an oops on his SMP system running 2.3.13 or 2.3.14. The oops didn't happen prior to 2.3.13. In the course of discussion, Alan Cox said, "power off on shutdown is not SMP safe. It kind of happens to work on a lot of boards. If making that APM call reformats your disk and plays tetris on an SMP box the bios vendor is within spec (if a little peculiar). No APM call of any kind is SMP safe."

Elsewhere it came out that the kernel was supposed to disable APM if it detects an SMP system, and Stephen Rothwell (the APM maintainer) said, "BUT ... The APM code got "hacked" in 2.3.13 (not by me) and probably doesn't work at all at the moment. I am trying to find time to send Linus a patch that gets it back to where it was. In particular, the check that protected SMP systems from running APM (except for power off) was moved and I haven't had time to figure out how it should be fixed."

11. Minimal DOS-type Partition Table Specification

24 Aug 1999 (1 post) Archive Link: "Minimal partition table specification"

People: Andries Brouwer

Andries Brouwer gave a pointer to his Minimal partition table specification, i.e., "the minimal information required to construct firmware that can interpret current DOS-type partition tables."

12. Explanation Of Some Complex Assembly

23 Aug 1999 - 25 Aug 1999 (6 posts) Archive Link: "complicated inline assembly"

Topics: Assembly

People: Richard B. JohnsonJeff EplerOliver Xymoron

in include/asm-i386/unistd.h, Hiren Mehta found the following assembly code and asked for an explanation:

#define _syscall0(type,name) \
type name(void) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
        : "=a" (__res) \
        : "0" (__NR_##name)); \
if (__res >= 0) \
        return (type) __res; \
errno = -__res; \
return -1; \

Richard B. Johnson explained:

This is the user-to-kernel interface for the kernel system calls. It takes the system-call number, puts it into register eax as a function code and executes the software interrupt 80 hex. Upon return, it checks for a negative value, also in eax. If it is negative, it puts the negative of the return value (now positive) into global errno, and returns -1.

Note that this is a MACRO so substitution rules apply for 'type' and 'name' Type would be typically 'int' and 'name' would be system-call number.

Jeff Epler compiled the code and disassembled the result, saying:

Look at the code generated by this (by syscall0(int sync)):

        movl $36,%eax
        int $0x80
        testl %eax,%eax
        jge .L3
        negl %eax
        movl %eax,errno
        movl $-1,%eax

Perform a syscall with __NR_sync in eax. If the result is >=0 (success), then return that value. Else, set errno to -eax and return -1.

Other syscalls are similar, except some other values are loaded into registers (up to 5 or 6 values -- not eax or esp, probably not ebp).

And Oliver Xymoron gave the equivalent C code (with some help from Mike Ricketts):

_syscall0(foo_t, foo) expands into approximately:

foo_t foo(void)
        long result;

        /* make the call */

        if(result>=0) return (foo_t) result;

        ITYM errno=result;

        return -1;

However, Peter Van Eynde wrote to me after KT's publication, referring to Richard's comments above:

"About the issue of this week, I might add to point 12 that the explanations given are not all true. The kernel does NOT return a "negative number" in case of error. I'm writing a direct-system-call system for CMUCL and had to search long before I found this comment in the glibc2.1 sources:"

| /* Linux uses a negative return value to indicate syscall errors,
|    unlike most Unices, which use the condition codes' carry flag.
|    Since version 2.1 the return value of a system call might be
|    negative even if the call succeeded.  E.g., the seek' system call
|    might return a large offset.  Therefore we must not anymore test
|    for < 0, but test for a real error by making sure the value in %eax
|    is a real error number.  Linus said he will make sure the no syscall
|    returns a value in -1 .. -4095 as a valid result so we can savely
|    test with -4095.  */

13. 2.2.11 Broken; Development Process Criticized

23 Aug 1999 - 25 Aug 1999 (14 posts) Archive Link: "2.2.11 unstable?"

People: Barry K. NathanM CarlingAlan CoxJon Masters

Jon Masters upgraded from 2.2.10 to 2.2.11 and his system started locking up, programs started dying, and his network would often die or slow down. Rui Sousa thought it was a hardware problem, but Barry K. Nathan gave a pointer to the Linux 2.2.11 Release Notes and said, "It's a known bug in 2.2.11, actually. It's been fixed in 2.2.12 (which will be out RSN)."

M Carling had a more general critique of why this problem occurred:

If you take a look at the changes, it's not too difficult to see why the "stable" kernels are not as stable as one might like. Lots of changes get in that are not strictly bug fixes. The most direct problem is the one you point out: that the "stable" kernels are unstable. However, there are other problems with a policy of back-porting new features to "stable" kernels. It reduces the incentive to get the current development kernel closed, thereby slowing the development cycle. I think this is a big part of the reason why 2.2 arrived more than two years later than 2.0. In other words, if new features were not added to "stable" kernels, then Linus would not be under so much pressure to continue accepting patches to the developmental kernels.

Enforcing a policy of accepting only bug fixes into the "stable" kernels would have three effects:

  1. the "stable" kernels would become more stable, probably much more stable,
  2. the pressure on Linus (and others) would shift from keeping the development kernel going longer to getting it closed sooner, which would shorten the development cycle (I think this faster development cycle would result in many features getting into the stable kernels sooner rather than later), and
  3. would make Linux much more palatable to enterprise IT departments.

People needing new features right away could either patch the stable kernel themselves or, in the case of popular features, use Alan's ac patches.

Alan Cox replied that the bugs in 2.2.11 were unrelated to back-ported features, adding, "they were caused by other bug fixes to nasty bugs that didnt die before 2.2.0"

14. Rebuilding Partition Tables

24 Aug 1999 - 26 Aug 1999 (3 posts) Archive Link: "Question: finding boundaries of ext2fs-partitions."

Topics: BSD: FreeBSD, FS: FAT, FS: NTFS, FS: ext2

People: Theodore Y. Ts'oAndries Brouwer

In the course of discussion, Theodore Y. Ts'o posted a list, compiled by Andries Brouwer, of tools to help rebuild partition tables. Although Andries wrote the list a few months ago, Ted felt it was still very much uptodate:

  1. findsuper is a small utility that finds blocks with the ext2 superblock signature, and prints out location and some info. It is in the non-installed part of the e2progs distribution.
  2. rescuept is a utility that recognizes ext2 superblocks, FAT partitions, swap partitions, and extended partition tables; it prints out information that can be used with fdisk or sfdisk to reconstruct the partition table. It is in the non-installed part of the util-linux distribution.
  3. fixdisktable ( is a utility that handles ext2, FAT, NTFS, ufs, BSD disklabels (but not yet v1 Linux swap partitions); it actually will rewrite the partition table, if you give it permission.
  4. gpart ( is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, minix, reiser fs; it prints a proposed contents for the primary partition table, and is well-documented.

15. Bug Introduced Into 2.3.x Message Queue

24 Aug 1999 (4 posts) Archive Link: "2.3.14 continues to break perl with libc5 (or does 2.3.14 refuse to support libc5?"

People: Alan Cox

David Dyck reported that he hadn't been able to compile the newer 'perl' sources under 2.3.x recently. Under 2.2.10ac12 there was no problem. Alan Cox explained, "The message queue code was changed by someone without realising the damage done I suspect. It is actually far worse than not compiling. Existing programs using message queues do not work on 2.3.x either because a structure copied into the end users program has totally changed size/layout." Alan replied to himself half-an-hour later with a patch. EOT.

16. devfs Version 118 Is Available

24 Aug 1999 (1 post) Archive Link: "[PATCH] devfs v118 available"

Topics: FS: devfs

People: Richard Gooch

Richard Gooch gave a pointer to his kernel patches page and announced the latest devfs patch, against 2.3.15-pre3.

17. PPP Over Ethernet

24 Aug 1999 - 25 Aug 1999 (2 posts) Archive Link: "PPPoE?"

Topics: Networking, Version Control

Rene Chaddock asked if Linux would implement PPP over Ethernet, and Bernhard Kaindl gave a pointer to his PPP Over Ethernet page.

18. CMI 3Com Internal Docsis Cable Modem

24 Aug 1999 (2 posts) Archive Link: "CMI 3Com Internal Docsis Cable Modem"

Topics: Modems

People: Joseph W. BreuAlan Cox

Joseph W. Breu asked if anyone was working on a driver for the 3Com Internal Docsis cable modem. He added, "I have some 3Com people comming tomorrow and was wondering what questions I should ask them to get moving in the right direction for Linux support." , and Alan Cox replied, "Ask them about documentation availability. 3Com have been very good with documentation for the Linux community for years."

19. New Modutils Maintainer

25 Aug 1999 - 29 Aug 1999 (5 posts) Archive Link: "Re: [RFC] New modutils maintainer"

People: Keith OwensBjorn Ekwall

Keith Owens gave a pointer to the modutils FTP site and announced, "Bjorn Ekwall asked for my current patches on August 10 but there are still no updates from him, even after 15 days and 2 reminders. So I reluctantly assume that Bjorn is too busy and declare myself the new modutils maintainer." Later on after some discussion, he added, "I had already privately mailed *EVERBODY* who had contributed to modutils in the past, including the current maintainer. I have been trying to contact them since July 19."

20. Work Started On madvise() System Call

24 Aug 1999 - 26 Aug 1999 (10 posts) Archive Link: "madvise() first draft"

People: Chuck Lever

Chuck Lever posted a fairly large patch against 2.3.15pre3 and announced his first attempt at an implementation of the madvise(2) system call. He wanted some feedback on his algorithm and approach, and added, "i'd like to see this eventually go into the kernel along with the mmap read-ahead stuff i'm working on (and in fact madvise will be more meaningful if the read-ahead stuff is there too)."

21. 2.3.15 Announced; Semaphore Code Rewritten

25 Aug 1999 - 30 Aug 1999 (10 posts) Archive Link: "Linux-2.3.15.."

Topics: Kernel Release Announcement, Networking, SMP

People: Linus TorvaldsAndrea Arcangeli

Linus Torvalds announced 2.3.15, saying:

There's a rather huge patch-set out there now, taking the 2.3.x series to 2.3.15.

This has a lot of the merge code I've been sent over the last two weeks, but I will invariably have missed some, if for no other reason than simply that I got absolutely _flooded_ by people sending me patches.

One of the more interesting things was the SMP pipe cleanup sent by Richard, but try as I might it was never really stable under load on x86 - not with the plain semaphores in 2.3.14, and not with the patches Andrea had either. I assume Richard tested it on an alpha with the much more well-thought-out atomic operation that the alpha provides.

I ended up rewriting the x86 semaphore code (and some of Richards pipe code too, for that matter, to get rid of some races in waking things up), and it doesn't show the problems I saw before, but hey, maybe I just exchanged one set of problems for another set that I can't trigger any more. Give me feedback, please.

Other features that don't impact everybody, but are rather major:

Have fun

Regarding Linus' rewriting of the semaphore code, Andrea Arcangeli posted an exploit and replied, "I guess the problem is the pipe code since I understood the old semaphores completly and there weren't SMP races there. Your new semaphores seems completly buggy to me and I am surprised your kernel works without crash or corruption with them."

Linus explained a flaw in Andrea's exploit, and replied:

Well, I certainly saw strange behaviour. The trylock code seemed to be the prime culprit - it tried to decrement the "waking" count, but it could end up doing it too late so that people had already seen a increment from a concurrent "up()".

I'm not saying the new code is bug-free, but it works for me where the old one did not - and your claim that it is obviously broken is also obviously wrong, see later..

They went back and forth for few posts, with Andrea sending in patches and making suggestions, and at one point Linus explained his reasoning:

You have one choice: fix things up. It already failed, there's no point in doing anything else.

We tried to be clever before. There was absolutely no data that it was ever a win, and there were lots of indications that it was buggy. Let's not make that mistake again. Don't optimize code that doesn't need optimization.

Btw, the case you optimize for is the case that is supposed to be _extremely_ rare even in the presense of contention. You optimize not just for the contention-case, you optimize for the specific case where the values are racing and changing on different CPU's at the same time. Do you _really_ think that it is worth it, considering that you make the semaphore behaviour more complex?

I really don't.

22. Debugging Threaded Applications

25 Aug 1999 - 27 Aug 1999 (16 posts) Archive Link: "Async user space notification from kernel?"

Topics: Ioctls

People: David S. MillerAlan CoxErik Andersen

Erik Andersen asked how the kernel could notify a user-space daemon of an event, without polling an ioctl(). Alan Cox suggested adding select support to a /proc file, and David S. Miller shifted gears with:

it could be quite useful for /proc/${pid}

It might even lead to a nice solution to the problem of debugging threaded applications. Here a lot of the problems with gdb getting things right have to do with reparenting and how thread libraries implement things internally, signals, what have you.

When you can do selects on /proc/${pid}/debug or whatever, a lot of the "what should we do if xxx" questions then have answers.

23. Module Init Code Handling

26 Aug 1999 (3 posts) Archive Link: "[q] int __init init_module()?"

People: Tigran AivazianAlan CoxDavid S. Miller

Tigran Aivazian noticed that some drivers had code similar to the following snippet:

#ifdef MODULE
int __init init_module(void)
int __init init_cmpci(void)

He added:

it makes sense to have __init in front of init_cmpci() but it seems rather suspicious to have it for a module since the code for throwing away .init* stuff is only called from free_initmem() on boot and does not seem to be used on loading modules?

On the other hand, if it *is* needed for init_module() then plenty of other places must be modified to have __init. So, in all cases, some changes are required.

Alan Cox replied, "It isnt needed, but hopefully one day modules will load, init and throw their init code away too after insmod returns" and David S. Miller added, "We had full support for this at one point. Jakub wrote the code, but Linus didn't take the patch set I'd sent him at the time since it was real close to 2.2.x"

24. Improved PLIP Driver

27 Aug 1999 (1 post) Archive Link: "[PATCH] Improved PLIP driver, take 2"

Topics: Networking

People: Nimrod Zimerman

Nimrod Zimerman posted a patch, and announced:

This is a second attempt at an improved PLIP driver. It does the following:

The patch as follows below compiles and works with 2.2.10. It should work with any other 2.2.x. It also compiles (and work) with 2.3.12, and probably with other development kernels, with a slight change.

Please let me know if this needs any more changes. I can see no reason not to include this in the kernel.

The patch can (also) be found at

25. User-Mode Kernel 2.3.15-1um

26 Aug 1999 (1 post) Archive Link: "User-mode kernel 2.3.15-1um"

People: Jeff Dike

Jeff Dike announced a new version of his user-mode kernel at

26. NetBEUI Spec Summarization

26 Aug 1999 (1 post) Archive Link: "I have NetBEUI docs"

People: Aaron Burt

Aaron Burt gave a pointer to the IBM Web Library and added:

Well, a very nice fellow at IBM Web Library sent me an electronic copy of the IBM Local Area Network Technical Reference. It's kind of old (Dec. 1990) but it describes wire-level NetBEUI, info I haven't found anywhere else. Features have been added since then, but I think this'll do to get a basic implementation going.

They are copyrighted, so I seriously doubt I could simply make the docs available, but the first thing I intend to do is summarize it into a simple spec. I don't have a clue how to implement a LAN protocol in Linux, but I'll see what I can do.

I remember NetBEUI coming up occasionally here. If you have any interest, skills or info, feel free to contact me. I'll announce when I have useful info up.

NetBEUI is waning in popularity, but it is a fast protocol and, like IPX, useful in legacy environments. Like DLC, it's commonly used for network printers.

27. Root Filesystem Unrecognized

28 Aug 1999 (3 posts) Archive Link: "2.3.15 hang on boot [PIIX4 IDE problem?]"

Topics: Disks: IDE, PCI, Power Management: ACPI

People: David WoodhouseGerard Saraber

Gerard Saraber found that 2.3.15 would hang during boot, while trying to remount the root device read/write. He suspected the IDE controller, and added that he had a Soyo SY-6BA+III with a PIIx4 IDE controller. He also included his /proc/pci file and his kernel .config file. David Woodhouse suggested turning off CONFIG_ACPI, which had solved a similar problem for him with the root filesystem not being recognized. Gerard tried this and it worked. EOT.

28. Remnants Of The Recent Attack

28 Aug 1999 (1 post) Archive Link: "From The Investment FAQ: Types of Mutual Funds"

People: David S. MillerMatti Aarnio

Continuing the saga of the recent attack on linux-kernel covered in Issue #32, Section #18  (17 Aug 1999: linux-kernel Under Attack) , it seems there are still some very low-traffic mailing lists posting from time to time on linux-kernel. It may be months or longer before David S. Miller and Matti Aarnio unsubscribe from them all.

29. IPVS Problems

28 Aug 1999 - 29 Aug 1999 (7 posts) Archive Link: "[PROBLEM] No ARP answer when two ifaces with same IP exist (2.2.x)"

Topics: Networking

People: Julian AnastasovDavid S. MillerAlan CoxAlexey Kuznetsov

Julian Anastasov reported, "When there are 2 interfaces dummy0 and eth0:1 (in this order) with same IP and dummy is DOWN, there is no ARP response from eth0:1"

He traced the problem to net/ipv4/arp.c:arp_rcv(), specifically the line

if ((tdev = ip_dev_find(tip)) && (tdev->flags & IFF_NOARP))

Alexey Kuznetsov couldn't find that line in his sources and suggested deleting it, but David S. Miller said, "It's a small change from the IP virtual server changes that went into 2.2.12 :-( Alan we have to deal with this somehow, it's causing havoc for many people."

Alan Cox replied, "The IPVS changes didnt go into 2.2.12. The ARP problem was one reason why not;" and Alexey said he didn't understand, and asked for an explanation. Alan posted at more length:

The ipvs folks are doing IP level load balancing with inverse masquerade--ie

       www1   www2    www3    www4     www5   www6

incoming SYN frames create masquerade sessions mapping user->www to user->www1 www2, etc according to load and rules.

In one mode they set things up differently as follows

    |     |     |     |           |          |         |
 MASQ    www   www   www         www        www       www
         www1  www2  www3        www4       www5      www6

Each host is www and a unique name. The masq host arps for www, then tunnels the packet to a www[1-n] but without rewrite. The reply doesnt touch the masq so is a lot faster.

However they need to stop www1->wwwn arping for their www tunnel address

30. Tulip 91g Success On 2.3.15

28 Aug 1999 (1 post) Archive Link: "tulip 91g success on 2.3.15"

Topics: Networking

People: Gerard Saraber

Gerard Saraber announced:

I would like to report success in modifying the tulip.c version 91g to work with linux kernel 2.3.14 and above, I have tested it on my home network by transferring linux-2.3.12.tar.bz2 from one system on my network to my development system which is using the tulip driver. The first time I tried to transfer the kernel it stalled (the tulip dropped off the network) at 7.5Mb transferred (at over 700kbps) I did ifconfig eth0 down and ifconfig eth0 up right after .. telnetted some, sent some email ftpd some more and tried to ftp the linux-2.3.12.tar.bz2 again .. this time success, again the speed is over 700kbps.

So it works for me :-)

Since the driver is over 100kb (39kb gzipped) I'm not attaching it here, If you want to try it, it can be optained through ftp:

I have the tulip.c, a gzipped version and a patch against the tulip v91g from

up there. I have made it clear in the header of the file and the startup banner that I'm the one who modified the file, so please don't send any nasty bugreports to Donald or the linux-tulip list without verifying that the bug exsists in the original tulip driver as well.

31. Low-latency Patches Benchmarked; Linus On BeOS

28 Aug 1999 - 30 Aug 1999 (27 posts) Archive Link: "Low-latency patches working GREAT (<2.9ms audio latency), see testresults ,but ISDN troubles"

People: Benno SenonerLinus TorvaldsIngo Molnar

Benno Senoner gave a pointer to his benchmarks of Ingo Molnar's latest low-latency patches. He reported excellent results, with some peaks at 2.9 ms. He pointed out, "With Mingo's patches the Linux low-latency performance comes very close to BEOS, and is much much better (3-4 times) Windows on the same hardware. It's now time to stress audio-software vendors to port their cool apps to Linux," and added, "I think most of us want to have these "low-latency" features in the upcoming 2.4 kernel since it will make Linux a very good _MULTIMEDIA_OS_."

On the negative side, he added, "The disk performance decreases by 10-25% when I increase the CPU load in the "latencytest" bench."

Someone replied that a 25% disk I/O decrease was very serious, and they wanted to get feedback from folks running internet and database servers before alienating server users in order to compete with BeOS.

To this, Linus Torvalds said, "Guys, if anybody thinks we're competing with BeOS, then wake up. BeOS is a niche OS that isn't worth competing against, and at most we can try to find out what it's good at and see if we can emulate some of it. But 25% disk IO decrease is definitely not something we want to even consider."

There was some discussion about the benchmarks, the patches, and Linus' comment about BeOS.

32. Booting From CD

28 Aug 1999 - 30 Aug 1999 (5 posts) Archive Link: "Little-known features of El Torito Spec"

Topics: FS: ext2

People: Dan ShearerTheodore Y. Ts'oH. Peter AnvinPeter Horton

Dan Shearer said:

Anyone who has sweated to build a 2.88 CD boot image from syslinux and the skinniest fat kernel possible will understand why I'm posting this. There is only a minor impact on the kernel, but overall it could make quite a difference to the way Linux installations are done.

I was reading through the El Torito specification a few months ago in order to debug something and noticed that the excellent mkisofs doesn't support some very interesting features of it. Neither does any other CD image mastering program I know of. As a result every x86 OS installation facility is somewhat crippled. Linux isn't as crippled as NT from this perspective, but still, booting from CD is nothing like home.

I am referring to the PDF of the hacked-up spec that we all have to use for CD booting, This could turn into a major marketing feature for Linux as well as relieving those poor people who have to write 2-stage installation programs.

First some observations on the annoying habits of everyone's install procedure:

So what does El Torito let us do?

  1. You can have multiple boot images

    A quick and simple alternative solution to what most people currently use that would help a bit is to have an initial CD image on a 2.88 floppy with little more than the kernel and some kind of mixed human and/or automatic logic to choose which of an arbitary number of other floppy boot images to boot as a secondary bootstrap. This might be the quick and dirty option for getting around current space problems. As far as I can tell you can chain around between images all day within the spec. I think the SuSE installer does this to some extent when they offer either an installation or an emergency boot floppy. But I haven't dived into it to see for sure.

  2. You can have no-emulation booting.

    This is the really interesting bit. Quote from the PDF in section "5.3 No Emulation Booting":

      --- start quote -------------
      When the Media Type is set to zero the BIOS does not use the CD to emulate a disk. The boot operation loads the requested number of sectors directly to the specified segment. When loading is complete the BIOS will jump to segment:0. The associated piece of software can be a "loader" (which provides its own CD interface), or it can be a stand alone program. The El Torito specification allows for the loading of FFFF sectors (This would allow the BIOS to fill the entire low 640k memory area with data). Once the system jumps to segment:0, the program can retrieve its boot information by issuing INT 13, Function 4B, AL=01. After the boot process has been initiated the INT 13 Extensions (functions 41-48) will access the CD using 800 byte sectors and the LBA address provided to INT 13 is an absolute sector number. This gives any program running in no emulation mode the ability to locate the boot catalog, and any other information on the CD, without providing a device driver.
      ---- end quote ----------

    In other words you can get a CD image to boot from a kernel jump-point. This isn't like a hard drive bootstrap, this actually loads up to 640k of data into memory and then sets the IP to an entry point. The kernel can then work out how to mount /. There would be no commandline parameters so I would think there would have to be a compile-time option to say "root fs on CD ROM".

    Provided we can get clever with the CD ROM drivers this is exactly what we need; the net effect would be that the installer developers could treat the CD exactly like a hard drive, with access to as much data as they liked. Booting Unix from read-only filesystems was solved years ago.

    OK, What Next?

    Someone has to sit down with an ISO image and do some sector editing and build one of these. Once we know what works and what doesn't then we can hack mkisofs. I'd probably be tempted to hack mkisofs first, it's fairly obvious what needs to change.

    I don't have any reason to do this apart from hack value. If someone else does I'm sure they'll get to it long before I've even looked at it. It might make sense for one of the distributions to fund some experiements.

    To head off some obvious suggestions:

    • I sent this to [email protected], the author of mkisofs. Bounced.
    • There was a thread called "Merging EXT2 and El Torito" on linux-kernel over a year ago, and it was nothing to do with this.

Theodore Y. Ts'o pointed out that the mkisofs author's email had changed to [email protected] when he changed ISPs. Ted also said:

This sounds really interesting! The big question is how many systems actually faithfully implement this part of the El Torito spec. Sorry for being paranoid, but the reality is that if it's not used commonly, there is an all-too-unfortunately high probability that some cheasy Taiwan-special (or even made in America) hardware won't support it correctly. <Insert standard grumbling about the cheap-sh*t Wintel hardware industry.>

Unfortunately, the only way I know for sure to check would be to make some ISO images available and ask people to test it and see whether it works.

Dan reported that a lot of other folks had emailed him privately with their interest, and that Peter Horton had given him a pointer to Colonel Panic, which had a patch for 'mkisofs' to allow non-emulation/hard disk booting. Dan added, "So if you or someone else gets to publishing test images before I do then good! The upshot appears to be no kernel mods required, all work done except for mass testing of BIOSes."

H. Peter Anvin replied:

Definitely. Note that you can't just bootstrap a bzImage this way; you still need a boot loader to go before the kernel. As I've already mentioned I intend for lbcon to support this booting mode (with full access to the ISO 9660 filesystem), but lbcon isn't ready for prime time just yet.

*Furthermore*, no-emulation does have a limit on the size of the image (640K), which means it may not be suitable to simply attach a stub loader to the image anyhow (after all, you might as well just make a disk image if you're going to do that.)

Hard disk emulation is known to be broken on many BIOSes, and as such isn't really an option.

33. Linux 2.2.13pre1

28 Aug 1999 - 30 Aug 1999 (10 posts) Archive Link: "Linux 2.2.13pre1"

Topics: Executable File Format, FS: NFS, FS: ext2, I2C, PCI

People: Trond MyklebustDavid WeinehallAlan CoxAndrea ArcangeliMartin MaresMatthias RieseDavid WoodhousePauline MiddelinkStephen TweedieMikael PetterssonRiley Williams

Alan Cox reported the changes in the latest prepatch for the stable series:

o       execve() fix - based on one by          (Tymm Twillman)
o       ext2fs flag fixes                       (Matthias Riese)
o       i2c tuner update                        (from Pauline Middelink)
o       bttv schedule on irq fix
o       Console race fixes/klogd                (Andrea Arcangeli)
o       Ensure version is up to date            (David Woodhouse)
o       QlogicFC fixes                          (Chris Loveland)
o       Fix memory leaks in the serial layer    (Armin Groesslinger)
o       ARM sound fixes                         (Phil Blundell)
o       Assorted warning cleanups               (Riley Williams)
o       Fix arcnet bug in 2.2.12                (Riley Williams)
o       Small NFS fixes                         (Trond Myklebust)
o       Updated sb1000 docs                     (Clemmitt Sigler)
o       Fix IPX packet handling                 (Kelly French)
o       PCI multifunction fixes                 (Martin Mares)
o       Back out mmap resource change           (Dick Streefland)
o       Minor cleanups                          (Mikael Pettersson)
o       Fix vt console print                    (Andrea Arcangeli)
o       Rate limit a.out binfmt errors          (Me)
o       Generate different ksyms for 1G/2G      (Me)
o       Small cleanups                          (David Weinehall)
o       Munmap, vm cache fix                    (Stephen Tweedie)

34. PCI Serial Driver Ready For Testing

29 Aug 1999 (1 post) Archive Link: "PCI serial driver ready for testing"

Topics: PCI

People: Theodore Y. Ts'o

Theodore Y. Ts'o announced:

After a very long delay, caused by my being terminally busy at MIT, and then changing jobs, I've finally gotten an update to the serial driver which supports PCI patches. It can be found at:

This driver supports for the Oxford Semiconductor 16C950 UART, and good collection of PCI boards (basically everything for which people have sent me patches, or for which I've been able to get my hands on the PCI serial cards).

I've tested this driver with ConnectTech, Sealevel, and GTek serial boards, and it has support for the SPCom 200 and Keyspan boards as well. I haven't tested the latter since I don't have the boards, though.

The sources are designed to work on either 2.2 or 2.3 kernels, and as shipped it comes with a Makefile which allows you to compile the serial driver as a stand-alone module outside the kernel tree. Alternatively, it should be pretty obvious how to copy the relevant source files (serial.c, serialP.h, serial_reg.h, serial.h, etc.) into the right places into the kernel, which will build the new serial driver as part of 2.2 or 2.3 kernel.

I'd like to send an update of this driver to Linus fairly soon for inclusion into the 2.3 mainline, so please send me any comments you might have. In particular, if you have some other PCI serial boards other than the ones supported by this driver, please send me the vendor, device, subvendor, and subdevice id numbers, how the PCI board interfaces to the system (mapped I/O memory, I/O ports) and what clock is used to drive the UART (i.e., what base_baud setting is needed; some drivers use a faster clock crystal which allows the port to run at speeds greater than 115200 bps).

35. Hitachi SuperH (SH3/SH4) Port

30 Aug 1999 (1 post) Archive Link: "Hitachi SuperH (SH3/SH4) port"

Niibe Yutaka announced his successful (it boots, mounts a filesystem, and runs "hello world") port of Linux to Hitachi's SuperH (SH-3). He gave pointers to the full code or a patch against 2.2.10

36. ROCK Linux Distribution V. 1.3.0

30 Aug 1999 (1 post) Archive Link: "ROCK Linux 1.3.0"

People: Clifford Wolf

Clifford Wolf gave a pointer to the ROCK Linux Distribution and announced version 1.3.0; he also gave a pointer to the changelog. The distribution is based on Linux 2.3.15 and XFree86 3.9.15; he added, "*WARNING*: There is a good reason for calling it a "development tree". If you are not interested in development you should take the stable ROCK Linux 1.2.0. You should install the development releases only on a seperate disk or a seperate computer where no importand data can be lost!"

37. Assembly Warnings Remain Unfixed

30 Aug 1999 (3 posts) Archive Link: "Assembler warnings 2.2.12"

Topics: Assembly

People: Horst von BrandAlan Cox

Someone noticed the following warnings during compilation:

make[1]: Entering directory /usr/src/linux-2.2.12/fs'
gcc -D__KERNEL__ -I/usr/src/linux-2.2.12/include -Wall -Wstrict-prototypes
-O2 -fomit-frame-pointer
+-fno-strict-aliasing -pipe  -m486 -malign-loops=2 -malign-jumps=2
-malign-functions=2 -DCPU=686 -DMODULE   -c -o
+binfmt_aout.o binfmt_aout.c
{standard input}: Assembler messages:
{standard input}:1019: Warning: using `%eax' instead of `%ax' due to `l' suffix
{standard input}:1019: Warning: using `%eax' instead of `%ax' due to `l' suffix

Horst von Brand replied, "I've sent a fix for this (and assorted other warnings) to Alan Cox, which for the case under discussion (and similar ones) was rejected by Linus in the end. The trouble is that the "right" way to handle this is to change the %ax et al to %eax and family, or use the "w" forms of the affected instructions. New binutils (current betas) handle these changes right, older ones generate idiotic code for them (unneeded prefixes, AFAIR). With what is in the kernel right now, new binutils generate the right code and complain, older ones generate the right code and keep quiet."







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.