Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #45 For 6 Dec 1999

By Zack Brown

Table Of Contents


Since there was no KT last week, this issue tries to cover the past two weeks.

Mailing List Stats For This Week

We looked at 1827 posts in 8248K.

There were 593 different contributors. 259 posted more than once. 167 posted last week too.

The top posters of the week were:

1. vfork() Discussion And Flame Fest

1 Nov 1999 - 19 Nov 1999 (57 posts) Archive Link: "Re: vfork"

Topics: BSD, Ioctls, POSIX, SMP, Virtual Memory

People: Andries BrouwerTheodore Y. Ts'oLinus TorvaldsKai HenningsenNate EldredgeAlan CoxWichert AkkermanRalf Baechle

Andries Brouwer started things off with a bang by creating a manpage for the vfork() system call. He posted the entire manpage, prefacing it with the following critical remarks:

People tell me that vfork() no longer is equivalent to fork() as the manpage states. Unfortunately, they are right, so I wrote a new page - see below.

I consider the introduction of vfork into Linux a very bad move (as will be clear from the text I wrote), but since there were people writing code and submitting patches there must be some positive side to this horrible kludge.

If nobody corrects me, this will be the vfork man page in man-pages-1.27.

Since it's fairly short, I include the manpage here as he posted it:

VFORK(2)             Linux Programmer's Manual             VFORK(2)

       vfork - create a child process in a broken way

       #include <unistd.h>

       pid_t vfork(void);

       The  vfork()  function  has the same effect as fork(), except that
       the behaviour is undefined  if  the  process  created  by  vfork()
       either  modifies any data other than a variable of type pid_t used
       to store the return value from vfork(), or returns from the  func?
       tion  in  which  vfork()  was  called, or calls any other function
       before successfully calling _exit() or one of the exec  family  of

       EAGAIN Too many processes - try again.

       ENOMEM There is insufficient swap space for the new process.

       vfork,  just  like fork(2), creates a child process of the calling
       process.  For details and return value and errors, see fork(2).

       Under Linux, fork() is implemented using copy-on-write  pages,  so
       the  only  penalty  incurred  by  fork()  is  the  time and memory
       required to duplicate the parent's page tables, and  to  create  a
       unique task structure for the child.  However, in the bad old days
       a fork() would require making a complete copy of the caller's data
       space,  often  needlessly, since usually immediately afterwards an
       exec() is done. Thus, for greater efficiency, BSD  introduced  the
       vfork  system  call,  that did not fully copy the address space of
       the parent process, but borrowed the parent's memory and thread of
       control  until  a call to execve() or an exit occurred. The parent
       process was suspended while the child  was  using  its  resources.
       The  use  of vfork was tricky - for example, not modifying data in
       the parent process depended on knowing which variables are held in
       a register.

       It  is rather unfortunate that Linux revived this spectre from the
       past.  The BSD manpage states: "This system call  will  be  elimi?
       nated when proper system sharing mechanisms are implemented. Users
       should not depend on the memory sharing semantics of vfork  as  it
       will, in that case, be made synonymous to fork."

       Formally  speaking,  the  POSIX  description  given above does not
       allow one to use vfork() since a following exec  might  fail,  and
       then what happens is undefined.

       Details of the signal handling are obscure and differ between sys?
       tems.  The BSD manpage states: "To avoid a possible deadlock situ?
       ation,  processes  that  are children in the middle of a vfork are
       never sent SIGTTOU or SIGTTIN signals; rather,  output  or  ioctls
       are  allowed  and  input attempts result in an end-of-file indica?

       The vfork() system call occurs in the BSD 2.9.1 (but  not  in  the
       2.8)  manual  pages.   In  Linux, it has been equivalent to fork()
       until 2.2.0-pre6 or so. Since 2.2.0-pre9 (on i386, somewhat  later
       on  other architectures) it is an independent system call. Support
       was added in glibc 2.0.112.

       The vfork call may be a bit similar to calls with the same name in
       other  operating  systems.  They all resemble fork(), have obscure
       semantics, and are not really faster today than fork().

       clone(2), execve(2), fork(2), wait(2)

Linux 2.2.0                 1 Nov 1999                          1

Theodore Y. Ts'o replied:

While in general, I agree with you that vfork is not the nicest thing in the world, it's not necessarily much worse than some of the other things you can do with sys_clone(), which also allows two processes to share virtual memory space.

Despite BSD's man page warning off folks from depending on vfork's semantics, there were programs --- including BSD's csh, which did use vfork() so that the path hashing statistics could be accurately updated. So, there is some excuse for actually implementing vfork() in the traditional bad old way.

I think you were being a little bit too harsh, myself, but it is true that warning people that using it compromises the portability of their programs is perfectly fair.

Someone replied that because vfork() had been only a pseudonym for fork() for so long, a lot of programmers used vfork() reflexively, on the assumption that it would one day be a "better" fork(). The poster suggested loud warnings about the new situation.

Linus Torvalds replied to the "POSIX DESCRIPTION" in Andries' original announcement, saying:

Just describe it the way it works.

vfork() under Linux is actually just another case of clone(), and the old reasons why it was considered horrible are basically all gone. The Linux mm layer evolved to the point where it was trivial to implement, WITHOUT any of the special hacks that the original BSD implementation had, and that made people hate it in the BSD community.

So the ugly part about vfork() doesn't exist any more, yet the good attributes still do.

But Kai Henningsen pointed out, "Well, in 2.2.12 it seems strace can't follow vfork()s without some kernel patch. Or so I gather from the docs and the failure to perform."

A surprised Linus replied:

I always thought that was just because strace didn't understand about vfork().

On the other hand, maybe it's a generic problem with strace - if it gets the child PID from the return value of "fork()", and attaches to it that way, that has two problems:

Who's maintaining strace these days?

Someone gave a pointer to the strace homepage, adding that Wichert Akkerman, the Debian Project Leader, was the maintainer.

Kai quoted from the 'strace' manpage:

-f          Trace child processes as they are  created  by
            currently  traced processes as a result of the
            fork(2)  system  call.   The  new  process  is
            attached  to  as  soon  as  its  pid  is known
            (through the return value of  fork(2)  in  the
            parent process). This means that such children
            may run uncontrolled for a  while  (especially
            in  the  case of a vfork(2)), until the parent
            is scheduled again to complete its  (v)fork(2)
            call.    If  the  parent  process  decides  to
            wait(2) for a child that  is  currently  being
            traced,  it  is suspended until an appropriate
            child process either terminates  or  incurs  a
            signal  that  would  cause it to terminate (as
            determined from  the  child's  current  signal

-F          Attempt to follow vforks.  (On SunOS 4.x, this
            is  accomplished  with  some  dynamic  linking
            trickery.  On Linux, it requires  some  kernel
            functionality not yet in the standard kernel.)
            Otherwise, vforks will not be followed even if
            -f has been given.

Linus replied:

That pretty much sucks, as it makes it pretty much inpossible to strace a normal fork() on SMP machines.

Sounds like the correct fix is not vfork()-related at all, but rather a flag to clone() to set "trace child process", so that the new child starts out stopped and traced. And some way for strace to get at the child pid.

Nate Eldredge came in, saying:

The man page is inaccurate with respect to the fork tracing mechanism. What happens is that strace uses ptrace(PTRACE_POKETEXT...) to insert an infinite loop following the trap instruction (so it becomes `int $0x80; jmp .'). So no matter when the child is spawned, it will get stuck in this loop. When the fork syscall returns in the parents, strace inspects the return value to find the PID of the child. It then attaches to this process, patches the original code back in, and sends it merrily on its way. (It also patches the original code back in to the parent.) So fork tracing should work regardless of the number of processors.

For vfork this won't work, as the parent won't return until the child has done something (exec or exit), and the child is looping. As a hack to work around this, when strace intercepts a vfork, it uses the ptrace(PTRACE_POKEUSR, ORIG_EAX...) mechanism (this is the "functionality not yet in the standard kernel", as that mechanism was disallowed until recent 2.3.x, and required a patch to enable. I will update the man page to reflect this and to fix the `fork' error.) to change the system call back into a plain old fork. I wrote that hack, and decided that it was reasonable on the grounds that vfork used to just call fork at the libc level, and anyone who depended on getting special vfork semantics was asking for trouble anyway.

In short, fork tracing works, and vfork tracing works with a new kernel or a patch.

To Linus' statement that what was needed was a flag to clone(), Alan Cox replied:

Its such a brilliant idea that we already had it...

        if (!(clone_flags & CLONE_PTRACE))
                new_flags &= ~(PF_PTRACED|PF_TRACESYS);

That was added for the threaded-gdb folks a while back 8)

Wichert replied to Alan, "Which means strace would need to turn fork into a clone call iff we're running on Linux and iff we have a recent enough kernel, right? That would mean adding more conditions to the code where it's already messy enough.. although it would make it easier to port (right now it goes wrong for mipsel for example, I think it still uses i386-instruction on a mips processor.. oops!)." Ralf Baechle replied to Wichert, "I wonder what you're talking about. I'm sure I didn't leave i386 code in the MIPS stuff - it wouldn't even assemble :-)" No reply came back on the list.

Linus replied to Alan as well, saying, "It doesn't notify the right parent, though. It also requires the tracer to turn a fork()/vfork() into a clone(), but I guess that's ok. The lack of re-parenting looks like a killer, though."

But he replied to himself 10 minutes later, with:

Hmm.. A sufficient fix for that might be kernel/fork.c:

        if ((clone_flags & CLONE_VFORK) || !(clone_flags & CLONE_PARENT))
                p->p_pptr = p->p_opptr = current;

would instead become

        p->p_opptr = current;
        if (!(clone_flags & CLONE_PARENT))
                p->p_pptr = current;

ie the original parent would _always_ be the "real" parent - which is what CLONE_VFORK has to use anyway, but CLONE_PARENT would make the "logical parent" be the same as the cloner. This should make debugging happy (the debugger stays as the logical parent), yet should be ok for pthreads like behaviour too, ie the reason for CLONE_PARENT in the first place.

The subthread ended here, but the inflammatory man page was not forgotten, however. In another branch of discussion a bit of a flame war erupted. Andries' claim was that there was nothing wrong with his factual description of vfork(), and that if anyone thought otherwise they should submit patches to him. Opposing points included the idea that manpages should be straightforward descriptions of functionality, without any judgements attached; and that the debate on the relative merits of vfork() should not be conducted in documentation. Some folks wanted to see the vfork() manpage dropped altogether, but Andries pointed out that the syscall existed, therefore it should be documented.

2. Read/Write Semaphores

8 Nov 1999 - 26 Nov 1999 (81 posts) Archive Link: "shm bug introduced with pagecache in 2.3.11"

Topics: FS: FAT, Real-Time

People: Manfred SpraulLinus TorvaldsAlan CoxBenjamin C.R. LaHaiseArjan van de VenAlexander ViroRichard Guenther

In the course of discussion, Manfred Spraul said, "I'm sure that for multi-threaded applications, the mmap performance of Linux will be poor because everything is single-threaded." He added that he'd do a benchmark comparing Linux and WinNT/95, and Linus Torvalds said, "I will bet you 5 bucks we'll kick ass."

Manfred did the benchmark, and said:

You've lost:

Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(], 620,000,000 bytes test file, fat filesystem, the same disk is used for NT and Linux.

command: "./pagein fill 150000 #" where fill is the filename, 150000 means 150000 pages are trashed, and # is the number of threads.

# pages/sec
1 13
4 14
64 14
256 ? [computer unresponsive]
# pages/sec
1 18
4 20
64 28
256 31
512 33

Linux is slower, and it cannot use multiple threads to reorder the sector reads; NT gets faster if I add further threads.

source code is at

Linus had some minor objections to the fact the Manfred used a FAT filesystem for his benchmark, but added, "I don't think this can/will be fixed for a 2.4 timeframe, especially as I haven't heard of any real-life usage where it would be an issue.."

Manfred replied that the filesystem wouldn't affect the numbers, and Richard Guenther added that he was working on a real-time audio-processing tool that would indeed suffer from the situation. He asked if there really was no hope of a fix before 2.4; Linus replied:

I looked at the thing quickly, and there _is_ actually a reasonably quick fix, and one that I would like to have for other reasons anyway: having read-write semaphores.

We don't have them, but they should not be fundamentally any worse to implement than the current semaphores are, and they have occasionally come up as useful things to have. They would certainly fit the bill for this particular problem very well (a page fault would get a read lock, while a mmap() would get a write lock - multiple page faults can happen in parallel).

The one race with multiple page faults that we have is handled nicely by the page table spinlock, which we introduced for kswapd anyway..

Elsewhere, Alan Cox pointed out that high-performance servers like Typhoon and Zeus would also be affected, but he added, "Fortunately these guys tend to be using pretty serious I/O subsystems not M/O disks and they are fine with 2.2." . Manfred asked if Alan knew if those servers were using mmap, and Alan replied, "Yes. Typhoon uses threaded mmap so aggressively it became an unintentional test suite for the Linux mm layer, and in 2.0/2.1 it found a lot of bugs."

At around this point in the discussion, Linus said:

Well, the more I look at a read-write semaphore, the more I like it: it looks like something that once the semaphore implementation itself was done, the MM side would be absolutely trivial. It does introduce a new issue (multiple threads updating the page tables at the same time), but that one doesn't look that horrible..

We don't ever export the page table handling to the low-level filesystems any more (we used to a long time ago: the nopage() function got to touch the page tables itself rather than just return the right page), so fixing up the new issue is actually a very local fix in mm/mmeory.c.

Is anybody willing to take a stab at creating a read-write semaphore?

Arjan van de Ven mentioned that "UNIX Systems for Modern Architectures" by Curt Schimmel discussed read-write semaphores on page 234. Arjan asked if Schimmel's ideas were an acceptable starting point, but Linus replied:

Nope, not acceptable. The mm semaphore is one of the most timing-critical in the whole kernel. It usually has absolutely zero contention, but it needs to be FAST. Basically, a read-lock() must look something very very similar to the read-spinlock implementation, ie something like

lock ; incl (%ecx)
js fixup

for the successful fall-through case. Two instructions, no more. That's what the spinlocks do, and that's also what the semaphores do (although in the case of a semaphore, it's a "decl" in that case.

The "fixup" case is going to be more complex than for spinlocks: for spinlocks it's just a simple loop, while for semaphores you get all the complexity that you see in arch/i386/kernel/semaphore.c to handle the thing cleanly..

The read-write semaphore should be doable with the same skeleton as the normal semaphores, although it needs two counters (regular semaphores have just "sleepers", rw-semaphores need to have "read_sleeper" and "write_sleeper" counts etc).

Alexander Viro asked if Linus wanted readers' and writers' waitqueues to share the spinlock, and Linus said:

I would go for something very similar to the current semaphore implementation - one global spinlock for all rw-semaphores, and only if that actually becomes a real contention point do we try to be more clever (starting with moving it to a per-semaphore thing, and only as a last thing doing separate wait-queues with separate spinlocks).

I doubt you'll get much contention. The current semaphores get very little contention - the test-case that triggered this discussion in the first place is probably the worst one by far, and that test-case will have no contention at all with the read-write version because 99% of everything is just readers.

The holy grail is "Make it as simple as possible. And no simpler"

Elsewhere he described in more detail what he was after:

I'll see if I can get a free afternoon some day and try to port the current x86 semaphore code over to a rw version too. The plan was something like this:

where all the three contention cases grab a "contention spinlock" before they then start sorting things out. The only interesting part is making sure that the contention case gets the wakeups, and the above counts on:

All other races should be trivially handled by just having the spinlock, so the only really hard cases are the fast-path stuff where we cannot get the semaphore because it is too expensive.

Does anybody see any holes in the above pseudo-implementation? Please take a look at the way the current x86 semaphores are implemented: they use exactly the above kinds of single-atomic-instruction-plus-condition-codes trickery to get the non-contention case without _any_ extra instructions.

There was a bit of an implementation discussion involving Linus and others, but elsewhere, Benjamin C.R. LaHaise took a different approach and even wrote some code. He posted his patch, saying, "Here's my implementation of rw semaphores for x86. It's 2 instructions for the non-contended case of down_(read|write) and up_read as I outlined yesterday. I've still got to test the contention case a bit more to be satisfied before I remove the readers/writers assertions, but I'd like it if people could give it an eyeballing and comment."

Linus replied:

This looks extremely good. I'll have to read it through a few times (and then a few more times just for luck), but on the face of it it looks solid and really clever. Me likee.

And here I thought _I_ was being subtle.

I wonder if you might be convinced to use the same approach for the rw-spinlocks too? I used to like my rw-lock implementation, but hey, I can recognize genius when I see it, and this sure looks like it.

(Having the same logic for both the spinlock and the sleeping version will make for less confusion, and I feel better about having _one_ clever trick used twice rather than having two different clever ways to do essentially the same thing).

3. Microsoft Historical Digression

10 Nov 1999 - 16 Nov 1999 (31 posts) Archive Link: "Re: Getting IOCTL's into VFS File System Drivers"

Topics: FS: ext2, FS: procfs, Microsoft, Patents

People: Jeff V. MerkeyAlan CoxH. Peter AnvinAlexander ViroRichard B. JohnsonVictor KhimenkoMike A. HarrisPeter SamuelsonDavid ParsonsKai HenningsenBrandon S. Allbery

Jeff V. Merkey suggested that the functionality of 'fsck' be incorporated into the Virtual Filesystem layer, the way Windows 2000 did. This way there could be a single 'fsck' utility that would work on all types of filesystems, instead of requiring each filesystem to provide its own. He explained, "In the Microsoft Windows 2000 Implementation, it's implemented as a single function call that gets invoked automatically (or manually) when a volume mount fails. They did it this way so that all their Windows tools would work with the different file system drivers without having to need a specially written program for each file system. They also have hooks for file system defrag, and backup/restore integrated into their IFS. If we could extend the VFS to do the same types of things, it would mean Linux would only need a single set of "generic" file system repair, defrag, and backup/restore utilities that would work with all the FS drivers in Linux."

The idea was universally condemned. Alan Cox's reply was, "What you do on a disk problem is policy. Official unix religion #1 is that policy goes in user space," while H. Peter Anvin replied to Jeff, "You're kidding, right?! This is definitely *not* kernel stuff. fsck is the same complexity no matter where it lives, and it is complex enough that it has nothing to do in the kernel." And Alexander Viro also said to Jeff:

What will it buy? You still have to write all this code. You still can't (and shouldn't) do it on a filesystem that is mounted r/w. This code can be equally easy placed into userspace (where it _is_ now). It doesn't give you any win in fsck(8), since the current fsck _already_ doesn't care for filesystem layout - it just calls an appropriate fsck.<something>.

All the difference you'll get will be that e2fsck will be moved into fs/ext2, fsck.minix - into fs/minix, etc. I.e. complexity is the same, except that the code will (a) permanently in core and (b) any bug will bcome more dangerous, since you are in ring 0.

So the question being: What For? We already have generic fsck. It sits in /sbin/fsck. We already have generic mkfs (/sbin/mkfs). What's the reason to move large chucks of code that feels perfectly OK in userland into the kernel? Automagical fsck upon mount? It's a bug, not a feature. That decision belongs to admin. Even if you want it (always want it, that is) you can trivially do it in mount(8) or in the script that calls mount.

If NT really does what you describe... Well, small wonder that it's so bloated.

There followed a humorous historical digression, after some lead-in: Michael Nelson replied to Alexander, saying that as far as he knew 'chkdsk' and the relevant DLLs were all user-mode code. Richard B. Johnson added, "both 95 and NT just do a chkdsk upon startup. Windows doesn't have the notion of "mount". Maybe Win-2000 will have, but nothing I've seen yet does -- and, if the file-system can't be repaired, you just lose everything and re-install windows, sometimes even if it was repaired. It depends upon the "service-pack" number and the alignment of planets."

Victor Khimenko replied, "What you are saying ? OF COURSE 9X & NT HAS mount syscall. There are no mount utility, it's right, but mount syscall is there. And it will return to you is chkdsk/scandisk is needed (that is driver was not unmounted cleanly). Just like in Linux. The only difference is that you can mount something only on drive letter and not in directory..."

Mike A. Harris replied with a smile:

Actually, from what I was just reading about Windows 2000, Microsoft is adding a new innovative feature that they single handedly came up with that allows you to "splice" a filesystem onto a directory point.

I read that the reason was to do away with drive letters. So, hats off to Microsoft for inventing this new concept of "splicing" filesystems onto directories.

Peter Samuelson smirked, "Patent pending, of course," and Victor added, "Even more interestion thing about this "innovation" is that MS DOS HAD ability to join filesystem onto a directory point (there was "join" command for that :-) This ability was removed from Windows95 and now we have new, innovative technology in Windows2000 ... Cool." David Parsons went on to say that 'join' was "at least 11 years old, and it's existed since MS-DOS 4.something. It's almost exactly mount, though MS-DOS didn't have a procfs to nicely export the mount table."

Alexander corrected him, saying that it had been around since 3.x, not 4.x; Alexander went on, "it required the mounpoint to be immediate subdirectory of root. Due to the way they've stored the namespace state the whole thing fscked up magnificiently if you tried to work with the root of mounted fs via the old drive name. SUBST was less b0rken, though. And more useful, BTW - it gave weak equivalent of tilde-expansion. Mixing them was Bad Idea(tm). I've looked at 3.30 kernel - scary and fascinating. Obviously modeled after v7, but _what_ a mess had they slapped onto the upper half for CP/M emulation... Scary. It almost looked like a small subset of UNIX placed on a box with rather shitty IO and buried under the heaploads of CP/M compatibility crap. They might start with CP/M clone, but 3.x internals looked rather like a castrated and mutilated Xenix."

Kai Henningsen offered a correction to Alexander, pointing out that 'join' and 'subst' used the exact same mechanism, so that "mixing them up in certain combinations was impossible exactly because they used the same mechanism, which could only store one path for each drive - so either the drive was a subst, or it was joined somewhere (or neither), but you couldn't have both."

Regarding Alexander's statement that, "3.x internals looked rather like a castrated and mutilated Xenix," Kai replied, "That was true since 2.x, actually, and what they mutilated was Xenix. It was officially called "Xenix compatibility"."

Kai added, "There is a persistent rumour that the "\" thing was because one of the developers simply got things wrong by accident," to which H. Peter Anvin added, "And it is also obviously bull. DOS 1.x (which was a pure CP/M clone) used / as the option character, so for compatibility they couldn't use it for paths... *especially* since DOS made it legal to type the option immediately adjacent to a pathname (COPY FOO BAR/V). DOS 2.x actually had an option to use - as the option character, which made it possible to use / as a pathname separator. DOS 2.x also had a kernel option to only recognize devices if the path was prepended with \DEV\ (or /DEV/), instead of polluting the namespace of every single directory. I believe OS/2 actually used this. To this day, every version of DOS 2.0 and later allows you to use / as the pathname separator in system calls -- but most utilities will see it as an option marker.)"

Brandon S. Allbery also replied to the "\" rumor, saying:

The ITT XTRA MS-DOS 2.11 Reference had a chapter on Xenix compatibility which explained this stuff in detail; I've never seen it in other DOS manuals.

CP/M, and therefore DOS 1.x, used / to signal parameters --- and the CLI knew about this, such that "FOO/X" would be parsed as a command "FOO" with an argument "/X". To maintain compatibility with DOS 1.x while providing an upgrade path to Xenix compatibility, the SWITCHAR could be set via a DOS call or CONFIG.SYS; if set to anything but "/", "Xenix-like" command parsing was used and DOS commands could be invoked with switches preceded by the specified SWITCHAR; 3rd party DOS programs were supposed to use another DOS call to get the SWITCHAR and use it appropriately. ("Xenix-like command parsing" meaning that "FOO-X" was not treated as command "FOO" with switch "-X" if the SWITCHAR was "-".) There was also AVAILDEV which, if set to NO, disabled "bare" device references such as "CON", and you had to use "\DEV\CON" or "/DEV/CON" instead; all the standard DOS 2.11 programs used the \DEV prefix internally so they would work with AVAILDEV=NO, and again 3rd party programs were supposed to query AVAILDEV and behave appropriately.

That chapter also stated that DOS 3.x would default to a SWITCHAR of "-" and would add limited multitasking features, and that DOS would gradually be migrated to full Xenix compatibility, followed by its being fully replaced by Xenix (!). I sometimes wonder what the computing world would be like if Microsoft had actually done this....

4. Some Explanation Of /proc

15 Nov 1999 - 18 Nov 1999 (12 posts) Archive Link: "Getting system info from the kernel"

Topics: BSD, FS: procfs, Ioctls, Networking

People: Alexander Viro

Jeff Buckey asked if there were any system calls to return the information available in '/proc/meminfo'. He was writing a program to display that information, and was currently using '/proc/meminfo', but had heard that relying on /proc files for program functionality was discouraged. Doug Alcorn replied that his impression was that /proc was provided specifically to allow user programs to access the data it contained, without relying on system calls. But Alexander Viro corrected him, saying:

The point of /proc is to avoid direct poking into the kernel memory. Moreover, in its current form it is highly non-portable. And that includes portability between 2.0/2.2/2.4. Please, _stop_ abusing it. Basically, _nobody_ promises that any given file in /proc will remain there in 2.4. Most likely 2.4 will keep compatibility symlinks, but even that is not guaranteed. If k3wl krapplications will break - too bad for them.

The thing is in flux right now. Some things will become sysctls, some will move into more reasonable places and some will simply die. The only documented parts are /proc/<pid>/*, /proc/self and /proc/sys. FWIC the rest is fair game.

In a reply to a reply, he added:

Let me put it that way: there is an area where nobody has decent (let alone portable) interfaces. Doing it _right_ would be nice and if we will get a clean filesystem-based solution it will not take much to take nullfs and roll the patches to *BSD. If some *BSD folks will fill the skeleton - fine, if not - their business. But *BSD lacks the thing just as we do. libkvm sucks too.

Decent namespace will fix most of our problems - we already have fs-based solution, but it's rather messy right now. Internal interfaces are getting more or less straight (we still have a crapload of interesting races, but they will be the next stage; most of intimate knowledge of procfs guts is gone from the rest of the kernel and that makes procfs fixes possible). But the namespace _is_ a mess and implementations of procfs methods are, should we say it, sometimes unorthodoxal (check what drivers/nubus/proc.c does. Or ISDN stuff. Or wanrouter). Ideally I'ld see more or less common format and a tree organised along the lines of buses hierarchy (kernel being the replacement for nexus ;-). But namespace stuff will go when we'll have the worst interface problems fixed and will have a list of animals that want to be there at all.

Elsewhere, he continued:

we had _no_ namespace policy in /proc. For many years. How would you like little gems like /proc/ip2mem or /proc/h8? No, the former has nothing to IP. And it has nothing to do anywhere near the root.

We have more or less regular parts of /proc - per-process stuff and /proc/sys (aka kernfs for BSD folks). The rest falls into several classes. There are "subsystem information" animals. They don't fit into sysctl() interface (albeit some of them might go there) and they are usually read-only. Not a big deal, except that we have 1001 format and _really_ messy namespace. Finding relevant thing may be very nontrivial. meminfo, loadavg, etc. would happily go there. There are oddballs that allow write() - usually it's a sysctl wanting to happen. And there are real horrors - come on, ioctl() on /proc file... (yes, we have such beasts; check /proc/mtrr for one). And there is kcore/ksyms/kmsg group. So far cleanup had been kernel-only, but now we are almost at the point where changes will become visible for userland. For the most of those files nobody will really care, but for some of them we will need compatibility links and they will stay for a while. As for the rest... Just about anything will be better than current ad-hockery.

Nobody talks about dropping the fs-based interface to this stuff and reverting to /dev/kmem games. But there is a serious need to reshuffle this bag of #@$! into coherent state.

5. Bug Identified From 2.1.0

15 Nov 1999 - 17 Nov 1999 (7 posts) Archive Link: "year old problem"

People: Andi Kleen

Eric Pouech pointed out that as of 2.1.0 (!), the debug registers were no longer saved in the TSS. Without this feature, it was impossible to do decent hardware-assisted debugging. Andi Kleen replied that the debug registers were saved and restored on a task switch. Eric pointed out that the task switch restored the debug registers, but no longer saved them. Andi replied that Eric had indeed found a bug, and there followed some implementation discussion, with no resolution on the list.

6. zoned 2.3.28-J5 Announced

16 Nov 1999 (4 posts) Archive Link: "[patch] zoned-2.3.28-J5"

Topics: Big Memory Support

People: Ingo MolnarChris Evans

Ingo Molnar announced zoned 2.3.28-J5, adding that it should fix nearly all known problems. Chris Evans asked what benefits the patch gave, and Ingo explained:

zones are separate physical memory (RAM) areas. eg. zones in a 6GB box look this way:

each zone is a 'pool of pages', with separate freelists and separate buddy bitmaps. The 2.2 allocator had everything in one big zone.

Higher order requests 'steal' DMA pages only as a last resort - previously GFP_DMA had to search for DMA-able pages by looking through all pages in a given page-list, plus normal GFP_ requests took DMA pages. So the zone allocator alone already gives much better GFP_DMA behavior, even on smaller boxes. In the future there will be GFP_DMA32 too.

The top-level structure is the 'zonelists' array, which contains a NULL-delimited list of 'target zones', in priority order. Eg. for GFP_HIGHMEM (which now covers the majority of allocations done in a Linux system) is { zone2, zone1, zone0, NULL }. For GFP_BUFFER it's { zone1, zone0, NULL }. The 'gfp_mask' parameter of the allocation functions is now an index into this 'zonelists' array, this gets resolved at compile-time in 99% of the cases.

Some other checks have been moved into the inlined part as well and get eliminated at compile-time. The page allocation entry points have been reduced to the minimum of 2 (formerly we had separate free_pages() and __free_page(), now it's all interfacing into __free_pages_ok()). The result is a more streamlined and lightweight page allocator. (despite the additional code it has). __free_page() is partly inlined now as well, the 'put_page_testzero()' thing is inlined, which is triggered in 60-70% of the cases.

the 'zonelists' array is generated runtime (once at boot), so systems which do not have highmem do not have to go through the empty zone every time.

7. Automatically Updating the RTC

16 Nov 1999 - 23 Nov 1999 (59 posts) Archive Link: "updating the RTC automagically"

Topics: Real-Time

People: Ulrich WindlRiley Williams

Ulrich Windl did some research on what would be required in updated the realtime clock (RTC) whenever system time changed. After a few heated words with Riley Williams, Ulrich added, "In case if you follow comp.protocols.time.ntp occasionally, you will find out that a lot of problems are related to problems where Linux does not update the RTC properly (e.g. when running localtime, not UTC). Let's say HP-UX 11.0 is a real UNIX if that helps you. I'm also aware that the RTC update code is basically unchanged since Linux- 0.99."

Riley replied:

I can understand the point you're making, but would have to point out that Linux is NOT the main culprit here - that 'honour' belongs to a certain MacroHard range of products.

To be blunt, where a system needs to dual boot between LoseSleep and *ANY* decent operating system, and LoseSleep is set to honour DST, then the RTC *MUST* be run in localtime rather than UTC/GMT as if it isn't, LoseSleep will trash it for you. If you want to get this fixed, persuade MacroHard to fix it as they're the only ones who can...

8. Historical Digression

16 Nov 1999 - 18 Nov 1999 (7 posts) Archive Link: "LK in a BK repository screen shots"

Topics: BSD, FS: UMSDOS, FS: ext2, PCI, Version Control

People: Riley WilliamsLinus TorvaldsLarry McVoyPaul Gortmaker

Larry McVoy gave some screenshots of BitKeeper hosting the entire history of Linux. Later in the thread, Riley Williams gave a pointer to his Linux history page, and said:

I'm just revamped most of those pages, but here's a quick analysis of the kernels and patches that currently appear to exist:

  1. There are lots of holes prior to 0.99.14 where released kernels appear to no longer exist. This includes most of the 0.99.13 subseries, of which only 0.99.13 and 0.99.13k now appear to exist.
  2. For most of the kernels prior to 0.99.13, patches were apparently never released, so one would have to use the individual kernel tarballs as the necessary sources.
  3. Although the actual kernel tarballs for 1.2.10 and 1.3.0 both still exist, the original patch between them does not. However, it would not be difficult to recreate it.
  4. With the sole exception pointed out above, all kernel tarballs and patch files from 0.99.14 to date still exist.

Based on the above, it should be possible to build up a repository showing the complete history of all kernels from 0.99.14 to date, but it would probably be meaningless to include any earlier kernels.

Note that the current set of links refer exclusively to kernels and patches available on and do *NOT* include any of the various author series (such as the -ac kernels). I will be adding other prime release kernels to that list as and when I find them on the Internet, but the pages include details of kernels and patches in my collection that are not on (identifyable by there being no link for the relevant "Version" or "Patch From" number respectively).

To item 3 in Riley's list, Paul Gortmaker pointed out that there had never been a patch betweeh versions 1.2.10 and 1.3.0; he went on to quote Linus Torvalds' announcement from June 12, 1995:

Ok, I finally made a public release of 1.3.0, and it's available in the normal places ( and

Only full-source versions of the kernel exist right now: any patches are likely to be huge, as lots of things have been moved around.

NOTE! 1.3.0 is a development kernel, and if things don't work perfectly don't be surprised. Lots of patches that didn't go into 1.2.x because they were too risky are in 1.3.0.

That said, I naturally run 1.3.0 at my own machine, and it seems to be pretty stable. Knock wood.

Anyway, a _very_ rough list of "what's new" for 1.3.0 (just a general overview):

[Digression mode on: I'd personally like to see the new libraries be 1.3.0-specific and just leave the old libraries as a "stable" version together with 1.2.x, but I don't know if that is really practical, and it's up to hjl anyway.]

The axp patches change a lot of things all over the place, notably by splitting up the PCI handling into architecture-independent parts and moving include-files areound a lot to better fit a multi-architecture setup. The axp patches have also resulted in various cleanups, and doing things "right" to be able to handle it cleanly on different setups. Thus the directory reading code is now much cleaner, and the mmap() system call will follow normal unix semantics more closely by actually honouring the "where" argument for non-fixed mappings and trying to find an address that is close to the requested area.

One downside: the UMSDOS filesystem is disabled in 1.3.0, as the directory cleanups broke that temporarily. Expect this to be fixed in 1.3.1 or soon afterwards. If you rely on umsdos, don't get 1.3.0.

Alpha-people: as mentioned, don't expect this to compile cleanly on alpha's yet. The ext2fs 32/64-bit problem isn't cleanly resolved yet, and some other unclean axp-patches haven't been integrated. Others haven't been tested in their new incarnation (the PCI stuff, for example: David M-T, how does the new setup look to you?). And the latest alpha-diffs by others haven't been integrated at all.

So, there it is: comments and patches are welcome. I have been slightly overworked with all the mails/patches/updates after DECUS, so I may well have ignored your particular favourite patch. If so, keep on re-sending it if you think it really is needed. But remember: not all patches need to go into the very earliest 1.3.x releases, and we have about 95 patchlevels to go on this thing yet..

9. Dangerous Website Advocated In Spam

17 Nov 1999 (6 posts) Archive Link: "hey wassup KErnEL ;)"

Topics: Spam

People: Michael H. WarfieldBernhard RosenkraenzerGerhard MackDerek Martin

An unknown assailant posted a URL to a site, and Michael H. Warfield replied:

Warning... This is NOT spam. It is a cybermined hostile web site! Do not go there. It is designed to infect and damage windows systems. We've analyzed some of the pages on that site. They are designed to bypass security checks (pages build URL's to hostile pages by assembling Java Script variables) and run hostile scripts. The silly ape has one script which attempts to format all of your hard drives (but it starts out attempting to format c: - well duh!).

It should have little effect on our Linux systems, but I still would not go there with any active content (Java or JavaScript) enabled!

Bernhard Rosenkraenzer added, "Another warning: The sender of this SPAM is using its recipients in the From: field, as well. At least two of the messages so far were sent from my addresses. The people in the From: field are valid addresses, but not at all responsible for whatever is happening. (I know - two people already complained to my sysadmin about having received the message from me)."

Rick Franchuk analyzed some of the site's html and found that one page forced visitors to send spam in their own name. Rick pointed out that the site was apparently being served from; Gerhard Mack replied:

Uhh no offence, but instead of complaining to the list why doesn't somone document all of this and complain to the uplink?

14 ( 289.606 ms 218.194 ms 219.797 ms

15 ( 359.599 ms 228.010 ms 270.004 ms

16 ( 229.419 ms 238.185 ms 219.788 ms

Don't rant, protect the week and destroy the evil :) (if worst come to worst that's only a t1)

And Derek Martin said:

Well, the guy wasn't very bright. He sent it to an address that didn't exist, and the list got the bounce message from his relay.

From the headers:

|------------------------- Failed addresses follow: ---------------------|
<[email protected]> ... transport smtp: 550
<[email protected]>... User unknown
|------------------------- Message text follows: ------------------------|
Received: from ---(really []) by
via smtpd with smtp
id <[email protected]>
for <unknown>; Wed, 17 Nov 1999 03:03:37 +1100
(/\##/\ Smail3.1.30.13.Y2K #30.35 built 1-mar-01)

The address the relay received it from was, which nslookup says is:


It's one of their own customers. Call them and complain. There's probably some charge you could bring against them, like stealing computer resources or some such thing, but they're in Australia, so good luck.

10. Adaptec Quartet64 Driver For Linux

17 Nov 1999 (3 posts) Archive Link: "Adaptec Quartet64 (ana-62044) support for linux ?"

Topics: BSD, Networking

People: Anton IvanovJes SorensenDonald Becker

Harald Evensen asked if the Adaptec Quartet64 were supported under Linux. Anton Ivanov replied:

I asked adaptec about a month anda half ago and got the usual blah, blah and Win + Novell praise and they were unable to answer anything about support under linux, solaris and BSD.

After that I asked this question on linux-network on Oct 13 and the answer from Donald Becker was yes. Check the linux-network archive for the full thread... (I am not subscribed to linux-net).

The driver for the new ones is not tulip (as per the old web page instructions on what google returns for you) it is the starfire:

Jes Sorensen added, "it's included in recent 2.3.x and works just fine (I have one of them here)."

11. When LVM And Others Will Go Into The Main Tree

19 Nov 1999 - 21 Nov 1999 (17 posts) Archive Link: "Re: Announce: LVM Patch against kernel 2.3.28"

Topics: Disk Arrays: LVM, FS: JFS, FS: ReiserFS, FS: ext2, FS: ext3

People: David WeinehallChristopher HornHans ReiserAlan CoxHeinz MauelshagenLinus Torvalds

David Weinehall asked when LVM (Logical Volume Manager) would be folded into the Linus Torvalds tree. He opined, "this is imho one of the most important things to go into the kernel, from an enterprise point of view." Ulrik De Bie replied that patches had been sent to Linus, but had not been acknowledged. Ulrik guessed Linus was pretty busy lately. Heinz Mauelshagen also said Linus had received it several times but had not replied. He suggested that maybe some noise from David would make a difference. Christopher Horn asked if anyone knew any reason why LVM should not be folded into the kernel, and said wistfully, "It would be a blessing, especially if the journaling Ext2 or Reiserfs stuff was also folded into 2.4 as well. The lack of a LVM and a JFS have unfortunately kept any serious Linux use out of our shop for a while now."

Hans Reiser replied, "I think that if you use the SuSE kernel you'll get a nicely patched well supported LVM for which we are developing a reiserfs resizer which SuSE will also support. (SuSE is a sponsor for ReiserFS.) I expect that LVM will eventually make it into the kernel, all of the FS developers that I know of for Linux have recommended that Linus add it. If you use a SuSE patched kernel you'll just get it somewhat earlier is all."

Alan Cox also replied to Christopher, saying, "I can see LVM getting into a standard kernel but not really ext3 (journalling ext2) or reiserfs. Ext3 adds stuff to the buffer cache behaviour that needs further figuring for 2.3.x to make Linus happy. Reiserfs exports half of the buffer cache into itself and includes extra C files in fs/buffer.c and has similar questions to solve. There is also a problem that right now neither ext3 or reiserfs can journal over software raid."

Hans and Alan had a bit of respectful dispute over the problems of reiserfs, in which Alan said things like, "I'm not blaming anyone for it. Someone asked for a state of play," and Hans said things like, "It's your decision to make and I respect it."

12. Raw Vs. Medium Raw Keyboard Mode

19 Nov 1999 - 24 Nov 1999 (12 posts) Archive Link: "Mark keyboard RAW mode deprecated"

People: Linus TorvaldsPavel MachekDavid S. Miller

Pavel Machek said that raw keyboard mode no longer made any sense, given the diversity of hardware in the world. He posted a patch to warn users that raw mode was deprecated. David S. Miller objected that X used raw mode exclusively; but Pavel replied that it would be trivial to convert X to medium raw mode, and would be necessary to get Mac's working properly. But to this, Linus Torvalds replied, "Why? Let people do whatever they want. I don't see the whole point of medium-raw being so incredibly superior." Pavel and others discussed the various problems for a bit, but at some point Pavel said, "Anyway, vojtech told me medium raw has big problem: limitation to 128 keys. Bad. So I'm stopping this." EOT.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.