Kernel Traffic
Latest | Archives | People | Topics
Wine
Latest | Archives | People | Topics
GNUe
Latest | Archives | People | Topics
Czech
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic
 

Kernel Traffic #5 For 11 Feb 1999

By Zack Brown

Table Of Contents

Introduction

There are many new interviews in our interviews section, including the IRC discussion with Linus Torvalds. We now include interviews with other free software folks. Help out by sending us some URLs.

If you'd like to be on our mailing list and receive announcements of new issues of Kernel Traffic, that's now possible thanks to our gracious host Mark Constable at opensrc.org. Thanks for everything Mark!

Mailing List Stats For This Week

We looked at 1089 posts in 4159K.

There were 460 different contributors. 177 posted more than once. 193 posted last week too.

The top posters of the week were:

1. Capabilities And ACLs

3 Feb 1999 - 9 Feb 1999 (19 posts) Archive Link: "Re: linux capabilities and ACLs"

Topics: Access Control Lists, Big File Support, Documentation, FS: NTFS, FS: ext2, FS: ext3, FS: smbfs, Microsoft, POSIX, Samba

People: Chris WedgwoodRalf CorsepiusAlbert D. CahalanJakub JelinekOliver XymoronMatthew WilcoxDavid WeinehallRik van RielThomas Pornin

This discussion began last week when David Watson volunteered to work on the kernel end of ACL functionality. ACL stands for Access Control List, and has to do with who gets access to what on the system. David posted primarily to indicate his interest and ask for pointers to documentation.

There was only a short discussion last week, including a staircase between Rik van Riel and Matthew Wilcox in which Matthew announced some very interesting introductory ext2 documentation, and Rik started the linux-doc mailing list (to subscribe, send a message to [email protected], with a body of "subscribe kernel-doc". The list itself is [email protected])

Oliver Xymoron had a lot of suggestions for Matthew's document, resulting in a new draft by Matthew.

Meanwhile, Thomas Pornin had some problems with ACL's ever being implemented. He pointed out that currently ext2 does not implement them, and that the ACL field that might have been used has been co-opted to allow files bigger than 4 GB. Thomas felt it would be impossible to implement ACLs and still allow large files in ext2, and suggested it was time for ext3.

Jakub Jelinek piped up with the fact that the file size limitation only refers to directories, and that it would be possible to have ACLs with files greater than 4 GB, just not with directories that were that big (he also pointed out that such big directories really aren't needed).

That was last week. This week the discussion continued with an interesting discussion of different ACL implementation. Ralf Corsepius pointed out that ACLs were in POSIX versions 1003.1e and 1003.2c, but those versions were dropped; Chris Wedgwood added, "I have it on good authority that the proposed POSIX ACLs are broken in such a way as they are not as secure [as] might otherwise be expected," and Ralf replied "Hm... Solaris and IRIX are afaik supposed to implement their ACLs quite close to that standard..."

Albert D. Cahalan came out with this shocker, also in response to Ralf's first post, "Considering the power (yes!) and popularity of NT ACLs, they would be a better choice than Netware or Digital Unix ACLs," and added, "BTW: the admin tools on NT do not expose the awesome power of the underlying architecture. There is more than meets the eye." NT being defended on linux-kernel!

David Weinehall suggested that Netware was an even more powerful filesystem, but Albert retorted, "I think you just don't like Microsoft. It is very bad to let your hatred of Microsoft cloud your judgement." He went on, "NT has an ACL model that can emulate the Netware and POSIX ones. It is good even before you consider smbfs, Samba, NTFS, and Wine," and added, "Roughly: the read-write-execute permissions are broken down into six basix permissions and some object-specific (file, dir, etc.) permissions. Access may be explicitly granted or denied. As far as I remember (don't make me boot NT), denial is applied last. You can mark files for auditing success or failure of various permission bits. Inheritance can apply to files, directories, both, or neither. Inheritance can be one-time-only."

2. Linux On The Cyrix Cx486SRx2

4 Feb 1999 (6 posts) Archive Link: "Cx486SRx2 (Re: cx5x86mod)"

Topics: Modules

People: Rafael ReilovaAnthony BarbachanTom Sightler

Anthony Barbachan got a new toy in the form of a Cyrix Cx486SRx2 386 to 486 upgrade chip, and there was a little discussion about it on the list. He started off looking for the patch for a driver that enables the chip's 1K cache, and offered to write it if the old one was too old.

Tom Sightler said he had patched the kernel up to 2.2.1 and could make his patch available, but suggested doing it in user space instead, and pointed out where to find the tool.

At this point, Rafael Reilova interjected, "I have bad news. After going over the code I remembered why this was never added to the std. kernel a loong time ago when proposed. Many of the options made some older boards unstable. The issues are the same as when it was suggested to make the suspend_on_halt of the Cx686 default to on. IMHO, this will have to stay as a module or (preferably) a user space utility. Only workaround would be to generate a white-list of boards where the optimizations are safe, but these are old boards and hard to identify."

3. Hunt For A Crash Exploit

4 Feb 1999 - 7 Feb 1999 (24 posts) Archive Link: "[patch] fixed 2.2.1 inode-leakage due bogus design of the free_inodes algorithm [was [Re: [showstopper] Memory leak in 2.2.1]]"

Topics: Debugging

People: Linus TorvaldsPavel MachekAndrea ArcangeliOleg Drokin

Filesystems Microkernels

Andrea Arcangeli and Oleg Drokin had a long discussion on the list, with other folks coming in eventually. Oleg found he could crash the machine. Andrea couldn't reproduce it, so they had a back-and-forth, narrowing down the relevant differences in their systems.

Finally Andrea reproduced Oleg's lockup. He posted a patch, then right away responded to himself a few times with corrections. Oleg tried the patch. It wasn't perfect, but he could not reproduce the lock-up. Andrea followed up with a new patch, and Oleg said he had no more problems.

At one point Linus Torvalds berated Andrea for recommending, during the bug hunt, that Oleg try killing the update daemon to see how many inodes could be created. Linus said, "If you kill the update deamon, your screwed. Don't do it. If you do it, whatever happens is your own fault, and you only have yourself to blame," and added, "So your whole point is meaningless."

But Pavel Machek had this rejoinder: "Linus, please, don't do this. It is sometimes usefull to have update suspended. If you have a notebook, and your drives are spinned down, it is bad idea to run update."

4. 2.2.1 Slowdown And Bug Hunt

4 Feb 1999 - 5 Feb 1999 (10 posts) Archive Link: "Re: sluggish 2.2.1"

Topics: Debugging

People: Richard B. JohnsonSang KangRik van RielSteve Dodd

An interesting problem: how can 2.2.1 be sluggish compared to 2.2.0, if 2.2.1 was such a small fix? There was a bit of puzzlement this week, as some of the top developers hunted an elusive bug.

Sang Kang felt a slowdown with 2.2.1; Jim Woodward and Steve Dodd posted confirmations of that sense. Richard B. Johnson said, "I have reported some things that might be related. See if the interrupt count for things (like serial) is increasing and increasing, etc., without any reason. There may be a race somewhere. These continuous interrupts do slow things down."

Sang replied, "Whatever it is, it doesn't occur after I went back to 2.2.0." Richard responded with, "2.2.0 has an awful bug that was the main reason for the quick release of 2.2.1. There should not be any difference that could affect performance. The bug-fix was an off-by-one calculation problem in freeing pages." He added, "you will probably find that the mere re-booting of your system (not the kernel version change), temporarily fixed your problem."

Rik van Riel, responding to Steve Dodd, said, "I believe it's in memory management, and in particular buffer handling." He went on, "once there are enough dirty disk buffers, the I/O queue is clogged up and pages from the disk cache are stolen, including executable pages and important program data." Then, "something seems to have changed between 2.2.0 and 2.2.1 because the normal/buffer/cache ratio is different on these two kernels. I haven't had time to investigate this more, though :("

5. Debugging Session

5 Feb 1999 - 8 Feb 1999 (54 posts) Archive Link: "Linux-2.2.2-pre2.."

Topics: Debugging, Development Strategy, FS: FAT, SMP, Scheduler

People: Linus TorvaldsPhilippe TroinTheodore Y. Ts'oAndrea ArcangeliOleg DrokinAlexander Viro

Let us know if you think this article is too long. The thread itself was an interesting, good example of the tone of a debugging session on linux-kernel. Those with something to say, said it clearly; mistakes were admitted; connections made; the kernel, advanced.

Linus Torvalds announced a new prepatch, and explained some of his thinking behind releasing prepatches: "None of the fixes are critical to most people, but all of them _can_ be critical to people who have seen vulnerabilities in the area. As such, if you're happy with 2.2.1 there is no pressing reason to test this patch out, but I hope to have the pre-patches so that the final 2.2.2 can be left around for a while (CD-ROM manufacturers etc would certainly prefer to not see lots of releases)."

As usually happens with release announcements that come at the start of a thread, this turned into one of the biggest threads of the week. Not so usually, Linus wrote almost half of its posts.

There were two main prongs of discussion. Philippe Troin started a debugging session with, "The enclosed program will kill any multi-processor SMP machine... Dorry for reposting this again, but I've been tracking down this for too long without getting any attention so far..."

Some folks couldn't immediately reproduce Philippe's effect. Linus and Philippe started a staircase in which Linus offered a patch to Philippe's program, then tried to explain part of the situation: "the hangup is done asynchronously at the next scheduling point, so the hangup even can actually happen in a random process context. So it doesn't matter who closes the file, the actual hangup might even happen on another CPU.." He added, "this is why we've had various problems with hangup in the past too: it's just so asynchronous."

In the same post, Linus appealed to Theodore Y. Ts'o to look into the problem. Ted came in a couple days later with a bit of explanation, and the suggestion, "the other option is to create a new task queue which is dedicated for handling tty hangups, and then change tty_hangup to use that new task queue. That may be cleaner long-term solution." In the two days between Linus' and Ted's post, however, there were more than 20 posts.

At first, that discussion was a bug hunt consisting of much speculation on where the problem might be; Andrea Arcangel stole the conversation at a certain point, and from then until Ted's post (about 10 posts), it was just Andrea and Linus.

Andrea wrote, "I just discovered the bug yesterday evening (but I had not the time to fix it)." The bug, as he saw it, was "that do_tty_hanghup() does a lock_kernel() in a tq_scheduler with a current->lock_depth > -1. So lock_kernel() does nothing there and so do_tty_hangup() was racing with the process that run the schedule() with the kernel lock just held."

At first, Linus didn't understand Andrea's explanation, and there was a bit of back and forth, until Linus said, "Ahh, now I see your argument," and added, "Good point, sorry for not catching on to what you meant."

They went on to further isolate the bug and find the proper approach to fixing it. Finally, Ted added his explanation of what he'd intended in the code, and that apparently ended that prong of the thread.

The second prong of the thread started with Oleg Drokin. He expected unused inodes to be freed, and found it not happening. He posted an exploit that seemed wrong to him.

Linus explained, "The inode numbers can easily grow past the "maximum", but once it reaches the maximum the growth should be stunted and controlled. [Oleg's exploit] is exactly the kind of behaviour you should expect: inodes freely grow until they hit the max number, and then the growth should slow down quite noticeably. Think of "max" as a soft limit rather than a hard one."

Oleg asked, "Yes. I can understand that. But why number of inodes only grows up?! Ain't when we need some memory we must shrink inode pool and give freed memory to those who need it?" to which Alexander Viro answered, "Because It's Done That Way (tm). The simplest way to fix it would probably be switching to slabs instead of using raw pages."

Alexander then had a long staircase with Linus. In that reply to Oleg, Alexander asked, "Linus, do you have any objections against it? If it's OK for you I'll roll a patch and submit it."

Linus replied, "We had it already for a short moment in the 2.1.x series, and it simply didn't work very well. It had all the slab problems with multi-page allocations, and inodes also have very hard-to-predict allocation lifetimes, so what happened was that when you allocated a lot of inodes, the pages were almost never freed back to the page pool, because there was often a few inodes holding things locked down anyway." He added, "It's a long time ago, but I essentially reverted it within a few releases because _I_ had problems with it."

Alexander also said, "Another question: we still have a race in rename() on UNIX filsystems (d_invalidate() stuff). It's pretty minor, so... Would you accept it or it would better wait till 2.3? I do not dare to submit the FAT-related piece - it's (a) too serious chunk and (b) I'm still not satisfied with testing it got here."

Linus didn't remember the problem, and Alexander explained, "If rename() is going to overwrite an existing directory it should check that it's empty and (obvious race prevention) nobody else uses it. Current code does shrink_dcache_parent() on the victim and then checks d_count. It is not enough, since we are leaving the victim hashed and anybody can grab it while we are checking emptiness and start to mess with it."

Linus replied, "I have to admit that it starts to smell like this should all be done in common code at the VFS layer rather than each filesystem having to know about this fairly subtle race. I designed the VFS locking exactly in such a way that the filesystems themselves wouldn't have to care about the directory tree consistency issues," and asked, "Would you mind looking into something like that? I'd be grateful."

Alexander jumped up with, "Heh. I would be more than grateful if you'll allow me to do that. rename() is *the* ugliest namespace-related method. Having all generic tests in VFS would make life much easier when we'll go for making VFS SMP-safe. BTW, the less parts of VFS are scattered over all fs drivers the more inpenedent all filesystems become. I.e. less pain in ass for folks maintaining AFS, ARLA, DMSDOS, r/w HPFS, etc."

Linus came back with, "I'll give you clearance, not for 2.2.2, but you wouldn't get it done by then anyway, so 2.2.3 would be your target. HOWEVER, I'd ask you to actually go one step further: the VFS layer should separate the case of renaming a subdirectory from the case of renaming a regular file." He went on with a description of what he wanted, and ended with, "If we have to change rename (and all low-level filesystems) to fix the race, let's just fix it once and for all, and separate the two rename cases properly."

Alexander asked for clarification ( "Could you elaborate? I'ld see the point if we would split the *method*, but I don't see the reason for splitting the VFS-level code" ) and Linus explained what he wanted, adding, "Basic rule: make it as complex as you have to, but no more. "

Alexander rejoined with, "That's why I want to do this thing. It makes the situation (and code) less complex. Anyway, it's *not* a 2.2.early issue and I think that it's not a 2.2 issue at all. If we'll have rename() serialization in VFS we'll be able to change it without touching (and breaking) filesystems. I'm glad that this pain in ass will go away now. If you really want to get a description of the lookup atomicity stuff I can sit down and turn my notes and comments into the coherent text in a week or so, but I don't think that time is right. I'ld rather wait with it at least until March/April. Save tomorrow for tomorrow."

There were a couple more posts on implementation details, and that ended the thread.

6. Heated Developer Argument

6 Feb 1999 - 9 Feb 1999 (36 posts) Archive Link: "Linux Graphics Architecture (format fixed)"

People: Linus Torvalds

Linus Torvalds stayed out of this one. It consisted of almost 40 posts of contention and argument that steered just clear of a flame war. Actually a lot of it was interesting, even when tempers seemed on the brink of flaring up.

7. Framebuffer Bug Hunt

6 Feb 1999 - 7 Feb 1999 (10 posts) Archive Link: "Framebuffer bug"

Topics: Debugging, Framebuffer

People: Arvind Sankar

This thread's ten posts took place in a single day, and were almost exclusively a staircase between Arvind Sankar who started the thread, and Navindra Umanee.

Arvind started with, "I've got kernel 2.2.1 running on x86. With the atyfb framebuffer driver enabled, if I run an X server while the console is already displaying another X, the machine locks solid."

Navindra tried to reproduce the lockup, and they went back and forth for awhile in an effort to synchronize their systems. Nothing worked, and the thread died.

8. Linux Kernel Mirrors

7 Feb 1999 - 8 Feb 1999 (7 posts) Archive Link: "Why isn't 2.2.2-prex on the mirrors?"

Topics: Source Distribution

People: Roman DrahtmuellerH. Peter Anvin

David C.S. Prior couldn't find the latest pre-releases, so some folks pointed him to where they were. In the course of this, Roman Drahtmueller said, "All _official_ kernel mirror sites (LKAMS == Linux Kernel Archive Mirror System) sync' with kernel.org very soon after a new file arrives in the tree. Thanks to Peter Anvin who made it possible," to which H. Peter Anvin replied, "And *huge* thanks to all the sites that have volunteered their time, machines and bandwidth to help out. 58 sites and counting..."

(ed. [] Thanks go to Hartmut Niemann who told KT that the informational point of the thread was that pre-releases can be found on all mirrors, in the /pub/linux/kernel/testing directory.)

9. Process Scheduling

7 Feb 1999 - 11 Feb 1999 (20 posts) Archive Link: "Real Time scheduler?"

Topics: BSD: FreeBSD, Real-Time, SMP, Scheduler

People: David S. MillerRik van RielPeter SteinerAndrea ArcangeliKurt GarloffVictor YodaikenShawn LeasIngo MolnarAlbert D. Cahalan

David S. Miller wrote, "Are there any patches available to incorporate a poor-mans real time function into the scheduler. FreeBSD's "rtprio" is what I'm thinking of. Basically it runs all "real time" processes in time shared fashion before running any "non real time" processes."

Shawn Leas put Rik van Riel's name forward as someone working on this, and Rik replied, "I am (or rather, will be once 2.2 stabilizes) working on the exact opposite: SCHED_IDLE processes which only run when the system's got nothing else to do," and added, "as for realtime, the Linux kernel has had poor-man's RT support for ages (3 or 4 years, IIRC)..."

Peter Steiner broke in with, "I already have something like that [Rik's SHED_IDLE]. It's a modification of how niceness values are interpreted. It uses a range of 11 niceness values to get the processor, e.g. when there's a process running at nice=0 then no proces with nice=11 or higher will get the cpu. On the other hand, if there's a process running at nice=-11 (eg. timidity or mikmod) then no 'normal' process will get the cpu. Processes started just with nice will still get a little bit of the cpu as usual (that's why the range is 11 and not 10)."

According to Andrea Arcangeli, Ingo Molnar also has a SCHED_IDLE patch.

Albert D. Cahalan had concerns about kernel locking with something like what Peter had written. Peter replied that he hadn't been concerned with deadlocks when writing the patch, and came up with an exploit that might take advantage of it.

There were a few replies to Peter. Rik said, "I've experienced a lockup like that ONCE in 5 months," and went on, "For me, this is serious enough to care about. I'll have a new (totally new) patch once the SMP reboot and APCI annoyances are fixed (rebooting the system now often takes 5 tries so I'm not patching at the moment :)."

Andrea also replied, saying he could reproduce the lockup in "two seconds" . He also suggested that a proper fix would be (rather than giving each process a minimal amount of cpu-time) "to see where the process came from."

Finally, Kurt Garloff replied to Peter with some interesting info about the exploit, "Problems like this one are known as PRIORITY INVERSION: The process with higher priority (=lower nice values) waits for a resource (or data) hold by a low priority process. Now, effectively, the high priority process is running at the priority of the low prio process. If the low prio process may never get scheduled (SCHED_IDLE), this is particularily bad, as the process could be stalled."

He added, "The solution is PRIORITY INVERSION: If a high prio process is waiting for something a low prio porcess provides, the low prio process should temporarily get the high prio. If the processes are passing data through the kernel or using kernel's resources, it should be detected by the kernel. If data is passed in memory (shared mem, e.g.), it's up to the userspace implementation to take care."

Victor Yodaiken disagreed with Kurt's solution, and wrote, "You mean "PRIORITY INHERITANCE" and this breaks the system in many many ways. First question is what "temporarily" means and there are several different answers, each is wrong in its own way." He included his own exploit: "Consider Process 0 ... 255 share a 1000 file descriptors in some Lowest priority process P0 does a select on FD {0} and is preempted by P1 which does a select on FD {0,1} blocking and passing priority to P0, which is then prempted by P2 which selects {2,3} etc. Good thing the kernel has nothing else to do but assist in computing this rolling deadlock. And then suppose that P0 has "alarm" set for 2 minutes and aborts its select halfway into this process. Time for the kernel to get really busy! We won't have to be jealous of the performance offered by NT anymore," and finished with, "And then consider how a kernel that depends on the operation of many daemon processes will continue to operate when users can introduce arbitrarily many "RT" processes that can block daemons indefinitely."

10. Philosophy Of Binary-Only Modules

3 Feb 1999 - 9 Feb 1999 (62 posts) Archive Link: "Re: Kernel interface changes (was Re: cdrecord problems on"

Topics: Binary-Only Modules

People: Linus Torvalds

This was a big thread, but one post stands out. Linus Torvalds writes:

I _refuse_ to even consider tying my hands over some binary-only module.

Hannu Savolainen tried to add some layering to make the sound modules more "portable" among Linux kernel versions, and I disliked it for two reasons:

Note that the second point is mainly psychological, but it's by far the most important one.

Basically, I want people to know that when they use binary-only modules, it's THEIR problem. I want people to know that in their bones, and I want it shouted out from the rooftops. I want people to wake up in a cold sweat every once in a while if they use binary-only modules.

Why? Because I'm a prick, and I want people to suffer? No.

Because I _know_ that I will eventually make changes that break modules. And I want people to expect them, and I never EVER want to see an email in my mailbox that says "Damn you, Linus, I used this binary module for over two years, and it worked perfectly across 150 kernel releases, and Linux-5.6.71 broke it, and you had better fix your kernel".

See?

I refuse to be at the mercy of any binary-only module. And that's why I refuse to care about them - not because of any really technical reasons, not because I'm a callous bastard, but because I refuse to tie my hands behind my back and hear somebody say "Bend Over, Boy, Because You Have It Coming To You".

I allow binary-only modules, but I want people to know that they are _only_ ever expected to work on the one version of the kernel that they were compiled for. Anything else is just a very nice unexpected bonus if it happens to work.

And THAT, my friend, is why when somebody complains about AFS, I tell them to go screw themselves, and not come complaining to me but complain to the AFS guys and girls. And why I'm not very interested in changing that.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.