Kernel Traffic
Latest | Archives | People | Topics
Wine
Latest | Archives | People | Topics
GNUe
Latest | Archives | People | Topics
Czech
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic
 

Kernel Traffic #34 For 13 Sep 1999

By Zack Brown

Table Of Contents

Introduction

Thanks go to Peter Van Eynde, who wrote in a correction to last week's issue. See his comments in red, at the bottom of Issue #33, Section #12  (23 Aug 1999: Explanation Of Some Complex Assembly) . Thanks, Peter!

The migration from http://www.kt.opensrc.org to http://kt.zork.net is nearly complete. Mark has set up redirects to the Linuxcare pages. If you haven't updated your bookmarks, you should. Also, as a result of the migration the mailing lists are no longer functional. I'll let you know when the new ones get set up.

In kernel news, as of 2.3.18 Linus Torvalds has declared a feature freeze! Next week's KT will cover that discussion (unless it's still going on). It looks like Linus is serious about shortening the release cycle. Here's his post:

Linux-2.3.18 is out there now, and it also marks the beginning of the long-promised feature freeze for 2.4.x. To make that freeze more effective, I'm taking two weeks off, just so that you simply CANNOT tempt me with features.

Thanks to David Hinds and others for the last weeks of merging of PCMCIA etc, I'm now officially happy.

Note that feature freeze is different from code freeze. We'll still do updates of drivers etc without being too anal about it, and even completely new drivers (or possibly filesystems) etc may possibly be accepted as long as they don't impact _anything_ else and don't imply a completely new approach to something. Drivers in particular tend to be updated even in the stable kernel, after all.

But expect me to be less than enthusiastic about even new drivers. New ideas for core functionality are right out.

The feature freeze should be turning into a code freeze in another two months or so, and a release by the end of the year. And as everybody knows, our targets never slip.

And as I said, don't even bother emailing me for the next two weeks, because you won't be reaching me anyway, and the mail accumulated over the two weeks will be unceremoniously dumped into a toxic waste container, to be buried in concrete somewhere at sea. Never to be opened again, in short.

Mailing List Stats For This Week

We looked at 995 posts in 4326K.

There were 415 different contributors. 174 posted more than once. 160 posted last week too.

The top posters of the week were:

1. VMWare Discombobulates The System

28 Aug 1999 - 1 Sep 1999 (6 posts) Archive Link: "2.2.12 doesn't lock cdrom drive door"

Topics: Ioctls

People: Walter HofmannAlan CoxJens Axboe

Walter Hofmann complained about 2.2.12 failing to lock his CDROM drive door when mounted, but under the Subject: Re: linux-kernel-digest V1 #4372, he reported, "This only happens after someone has used vmware on the computer. OTOH it is strange that Linux can't lock the door even after vmware has terminated." Alan Cox explained, "Vmware isnt a unix application constrained by the OS. It does hardware level stuff of its own," but Jens Axboe put in, "I think that VMWare actually disables door locking and control it manually with an ioctl to get complete control over it. That's what they told me, anyway, but not having the source does not allow me to check. It sounds like that is causing problems - I bet that they never re-enable door locking again. Take the problem up with them," and Walter replied that he had filed a bug report via their web form.

2. Kernel Crypto Issues

28 Aug 1999 - 3 Sep 1999 (29 posts) Archive Link: "idea: MAC level compression & crypto"

Topics: Compression

People: David WoodhouseAlan Cox

In the course of discussion, Alan Cox mentioned that crypto hooks still can't go in the main kernel tree because of US laws. Elmer Joandi was surprised that even hooks were prohibited, since they weren't actually crypto.

David Woodhouse explained:

The US Government, in its wisdom, has decreed that crypto hooks aren't allowed to be exported either.

Presumably that's because without the hooks, we foreigners aren't even clever enough to link in the existing crypto we've already smuggled from the US, so that rule will stop us from using crypto just as effectively.

However, I believe you're allowed to export hooks if they're for compression. Apparently we're not clever enough to plug crypto modules into compression hooks either :)

There was a discussion of the unenforcability and naivete of the US laws, and how those laws are getting worse not better, even though countries like France have abandoned similar crypto policies.

3. Linus Opposed To Raw IO

28 Aug 1999 - 1 Sep 1999 (3 posts) Archive Link: "Streaming disk I/O: don't use raw, limit bufs per device/partition"

Topics: Raw IO

People: Linus Torvalds

In the course of discussion, Linus Torvalds said:

I do not believe in raw IO - even for streaming audio it's just too common for the data to have been available in the cache, and by using raw IO you (for absolutely no good reason) just made the machine do more IO than it should have.

There are very specific cases where the application knows that its dataset is larger than physical memory, but those tend to be limited to quite large problems. And they're getting larger.

But having a better way to decide what to throw out - that I am a strong believer in.

4. ReiserFS Nears Readiness; Difficulties Discussed

29 Aug 1999 - 1 Sep 1999 (18 posts) Archive Link: "vm kills processes in our 2.3.12 port of reiserfs - what was the story on the changes to mark_buffer_dirty() and the too many dirty buffers issue?"

Topics: Code Freeze, FS: NFS, FS: ReiserFS, FS: ext2, SMP, Virtual Memory

People: Andrea ArcangeliHans ReiserTheodore Y. Ts'oLinus Torvalds

Hans Reiser had reiserfs running on 2.3.12, but 'dbench' was suffering from processes getting killed for memory request failures. Hans remembered that some 2.2.x code in mark_buffer_dirty(), intended to address that very issue, had been dropped from 2.3.x; the problem did not occur with ext2fs instead of reiserfs; and Hans asked additionally why vm chose to kill rather than stall the offending processes. Andrea Arcangeli posted a patch that he felt would solve the problem, and explained:

I think it's because mark_dirty_buffer doesn't enforce a limit in the grow of dirty buffers in the system.

Baically the buffer code checks if there are too much dirty buffers only for the data writes (because data writes usually pass through block_write_partial_page or the other equivalent filesystem-helper functions in 2.3.x).

So if your filesystem writes 120Mbyte of metadata before the first data write, then you can fill the buffer cache with 120mbyte of dirty buffers without blocking the application waiting for some write-I/O completation. This unlimited grow of unfreeable memory in cache may lead to the VM subsystem to be not able to recycle the cache in time and so the tasks that needs memory will be killed (as happened in the early 2.2.x).

In the course of discussion, Hans added that this was the final issue preventing reiserfs from getting into 2.3.x; and that journalling was late ALPHA and almost ready for actual use. Elsewhere, he added, "We have hired an SMP specialist, and he is going to dump all our schedule tracking crud that currently causes us to have our own version of getblk, etc., but it will take him time to do that, as he is new," but he also expressed some dismay, "I think it is more than a little unfair to let the ext2 folks rewrite VFS and ext2 in parallel, and then announce an impending code freeze right after VFS has been radically changed."

To this last, Theodore Y. Ts'o replied, "For the record, it wasn't the "ext2 folks" that rewrote VFS, and most of the changes happened fairly early in the 2.3 kernel series. If you had been following the 2.3 kernel development, you should have had plenty of time to deal with the VFS changes. It wasn't like the VFS changes went in just before the code freeze."

Linus Torvalds also replied to Hans' complaint, saying:

Well, the ext2 folks actually didn't have any input on that side at all.

It's just that the people who DID have input on the new VFS layer (really just Ingo and me, although some others at least saw what happened) only used ext2 (and in my case NFS - NFS to some degree was actually the first "real" page cache client).

The page cache design was actually done with the notion of "let's not care how filesystems do this now, let's just care about how we _want_ it done", and the new code should be pretty easy to use for new filesystems. It can be a real bitch to integrate old filesystems to, though, although there was a lot of work to make it easier for the common cases.

5. Announce: Performance-Monitoring Counters Patch V. 0.5 Is Out

30 Aug 1999 (9 posts) Archive Link: "Announce: release 0.5 of x86 performance-monitoring counters patch"

Topics: SMP

People: Mikael Pettersson

Mikael Pettersson gave a pointer to http://www.csd.uu.se/~mikpe/linux/ and announced:

I've updated my x86 performance-monitoring counters patch, which provides user-space access to the performance-monitoring counters (PMCs) in Intel P5/P6, Cyrix, and WinChip processors. The current version is release 0.5 for kernel 2.3.15.

The current release provides

  1. virtual PMCs and TSCs that are saved and resumed automatically as processes are,
  2. hardware support for Intel, Cyrix, and IDT WinChip [the AMD K7 may be supported if I can find the appropriate documentation], and
  3. a "remote-control" feature which allows "monitor" processes to control and sample the PMCs of other processes. The code should be SMP-safe, although I have not been able to test it on an SMP machine.

(Item (3) is the main change since the previously announced release 0.2 from June.)

6. sysinfo Struct Incompatible Changes Between 2.2.x And 2.3.x

30 Aug 1999 - 31 Aug 1999 (3 posts) Archive Link: "2.3.16-1: 'struct sysinfo' ABI has incompatible change"

People: Peter BenieMichael Elizabeth Chastain

Michael Elizabeth Chastain noticed that some new variables had been added to the sysinfo struct (passed as an argument in the sysinfo() system call) in 2.3.16, and pointed out that this change in the struct's size would cause 2.2.x clients to give strange results on 2.4.x servers, and would cause actual memory corruption in the case of 2.4.x clients of 2.2.x servers.

Peter Benie replied, "The sysinfo interface produces strange effects anyway - you can't tell if a particular field is valid or not without using a lookup table from kernel version numbers to fields. A version number, a set of flags or some other indicator could be added to the structure so applications tell what values the system filed in. (The caller would have to initialise the version to 0 to detect the current interface.)" david parsons pointed out that in properly implemented programs, a sysinfo() call compiled on an earlier kernel will still work when run on a later one.

7. BSDi Timestamp Bug

31 Aug 1999 - 1 Sep 1999 (5 posts) Archive Link: "Linux + ISDN + SLOW speed on BSDI systems"

Topics: BSD, FS: sysfs, Networking

People: David S. Miller

Bas Oude Nijeweme found that for any 2.2 or 2.3 kernel, he got a very low transfer rate on data coming from BSDI systems. Connections to non-BSDI boxes had normal transfer rates. David S. Miller suggested turning TCP timestamps off by giving a "echo "0" >/proc/sys/net/ipv4/tcp_timestamps" command, and explained, "There are some known bugs in BSDi's TCP timestamp implementation, and the consequence of this bug (when hit) is that packets are completely dropped and performance suffers." Bas reported ecstatic success. EOT.

8. tsx-11.mit.edu Upload Lag

31 Aug 1999 - 1 Sep 1999 (3 posts) Archive Link: "watchdog location"

People: Theodore Y. Ts'o

Michael Meskes said he had to change the primary location of his watchdog program, explaining that the TSX-11 FTP site had stopped processing its incoming queue, and email to the archive team went unanswered.

Theodore Y. Ts'o replied:

Umm, oops. <blush>

TSX-11 has been in <<media res>> most of the summer, since it's been in between a (half-finished) computer upgrade, and I (and the other archive team members) have been consumed with other activities. In my case, it was due to my changing jobs. I've updated the watchdog deamon, and I will be working on processing the rest of the incoming queue this week.

In the meantime, if you don't get an answer sent to the archive e-mail address, please try sending mail directly to me ([email protected]). I'll make sure it gets dealt with.

I should have the hardware transition finished this month, and things should be better after that. I apologize for the disruption and the inconvenience.

In the meantime, of course, if you choose to change the primary location of the watchdog deamon to another FTP server, I really can't blame you. Again, my apologies.

This explanation satisfied Michael, and he said he wasn't going to switch his primary site after all.

9. Files Appear Multiple Times Over NFS Under 2.3.x

31 Aug 1999 (3 posts) Archive Link: "NFS bug in 2.3.13?"

Topics: FS: NFS

People: Larry McVoyDavid S. Miller

Larry McVoy noticed that with NFS under 2.3.13, readdir() would report a given file multiple times in the same directory, and he was also seeing "nfs_dentry_delete: src/LOD: ino=879232717, count=2, nlink=1" messages. David S. Miller replied that this was a known problem with 2.3.x, on his TODO list, and would definitely be fixed before 2.4 came out.

10. 'ping' DoS Exploit

31 Aug 1999 (7 posts) Archive Link: "Userlevel ARP request"

Topics: Security

People: Richard B. Johnson

Mike Panetta asked if the kernel would allow a user-mode program to test if a machine was on a given IP, based on an ARP on that IP. Richard B. Johnson said there was no problem, and suggested "if ping -c 1 hostname ; then do_something ; fi", explaining, "Warning. If you intend to create a program like the Micro$garbage stuff that pings every possible IP address on your network to see if some host is up, you will find a lot of unhappy network neighbors. ARP requests are broadcast. Without disabling your network, you can't filter them out. This means that they are received by every ISR on every machine and dumped on the floor. This can (read will) consume 80 to 90 percent of available CPU cycles on every connected machine, slowing all machines down to a crawl, when you have a hundred or more Micro$garbage machines pinging all possible hosts on the LAN. Micro$garbage has been contacted, threatened with Lawsuits, etc., but they don't even reply to registered mail."

11. PCI Enhancements For 2.3.16-pre1

31 Aug 1999 - 3 Sep 1999 (21 posts) Archive Link: "PATCH: PCI changes for pre-2.3.16-1"

Topics: PCI, Power Management: ACPI

People: Martin MaresLinus TorvaldsDavid Hinds

Martin Mares posted a patch against 2.3.16-pre1, containing a batch of changes to the PCI subsystem. He summarized the changes:

  1. Updated Documentation/Changes.
  2. Added pcibios_assign_resource() to arch-dependent code. This function should handle address assignment when a new device (or, more frequently, a new [read: misconfigured] region on an old device) is found.
  3. Changed i386 bios32.c to use pcibios_assign_resource() for all allocations.
  4. Distinguish between pci_dev->name (only bus/slot/function, used for PCI subsystem initialization messages) and pci_dev->full_name (used for resource management stuff).
  5. Added pci_find_capability which is a library function to be used by drivers whenever they need to walk the PCI capability lists.
  6. Removed pci_dev->master flag.
  7. pci_scan_bus: If the bus already exists, don't attempt to scan it again (yes, there really exist machines with a single bus behind two different bridges [a host bridge and a fake PCI-to-PCI bridge], argh).
  8. /proc/pci works again.
  9. Modified the /proc/pci output format. I've left out several non-interesting status bits like devsel timing, but I tried to keep as close to the original format as possible not to break programs parsing this file (ugh!).
  10. Removed PCI_REGION_* macros.
  11. Added fixup for broken S3 cards reporting 32M region sizes instead of 64M.
  12. Changed allocation of PCI resources -- we really need to know about address ranges assigned to individual PCI buses to be sure which addresses are free and which are routed to which bus. Each bus now has four pointers to resources from which are all resources of devices on this bus allocated (see pci_find_parent_resource() for a matching algorithm). As usually, everything can be overriden in arch-specific code.
  13. Removed the i448BX fixup as it was superseded by the bus resource management changes.
  14. resource.c: ioport_resource should have IORESOURCE_IO flag set, iomem_resource likewise.

To item #4, Linus Torvalds had a minor technical objection:

This is just crap.

You should NOT use bus/slot/fn in the name. For identification purposes, you can just use the numbers in the "struct pci_dev" thing directly - no need to try to continually get them into the name, because they are already available as pure numbers.

Numbers are numbers, and should be of type "int" or similar.

Text is text, and should be of type "char []".

Why do you have this need to mix the two?

Martin replied:

For "normal" purposes when we need to report information about the device itself (as in /proc/ioports), we surely should use the full name which contains only vendor and device.

For "special" purposes like error messages printed by PCI init code or by device probing in drivers, it's more important to let the user know the bus and the slot, not the real name of the device. Hence, you can think of pci_dev->name as of name of the _slot_ and pci_dev->full_name as of name of the _device_, but we really need both of them.

Linus agreed, but he added:

the thing I object to is to saving that special string away, when the information exists there as-is. I really don't see the advantage of

        printk("%s", dev->name);

over the much clearer

        printk("%d:%d:%d", BUS(dev), SLOT(dev), FN(dev))

Sure, the latter is slightly longer, but at least it's obvious what it will actually print out - it's clear that now we're printing out the _position_ of the card rather than it's "name".

Elsewhere, under the Subject: PCI patch for 2.3.16, Martin posted another patch, explaining:

The PCI saga continues ... there is a small patch against 2.3.16:

  1. pci_dev->slot_name defined and used (I hope this variant is not confusing anymore)
  2. Use PCI BIOS IRQ routing table (if available) to find all the peer buses. This should be a way more reliable than the peer bridge magic we were using before (and still are using if there is no routing table).

The remaining things I'd like to solve before 2.4 (except for bug fixes, of course :)) :

  1. Introduce some mechanism for manual setting of the kernel view of IRQ's, so that people having motherboards with broken BIOSes can fix the things manually.
  2. Write a script creating devlist.h from the pci.ids file.
  3. Write helper functions for allocation/freeing of PCI resources. As far as I remember, we didn't agree yet whether there should be a single function allocating all the resources (and drivers with special needs not using it) or a function for allocating a single resource (called multiple times by each driver). I prefer the first solution as it can be implemented as a `pci_enable_device()' type function which could do things like waking up powered down devices as well.

To the first item of Martin's second list, Linus replied:

I expect this to blow up for a number of people - how certain are you that all BIOSes really do this right? I'd be surprised if there weren't problems with pointers to >64kB segments etc on various architectures, resulting in nonbootable systems.

As far as I know, the information should be in the ACPI tables, and those should be findable without any BIOS calls.

Quite frankly, I'm not willing to start to use more BIOS calls that are likely to be broken on some (unlikely) machines just to avoid problems on other (unlikely) machines. EVERY time we've had a BIOS interface, we've had trouble on some machines. We don't want to go down that path.

But David Hinds pointed out, "Actually, the PCI interrupt mapping tables are often available without using any BIOS calls. You can just scan 0xf0000-0xfffff for a special signature ('$PIR') and read the table directly: this is one of the defined ways of retrieving this information. I do this in the current PCMCIA code to get the interrupt assignments for CardBus bridges."

12. NFS Fixes

31 Aug 1999 - 2 Sep 1999 (10 posts) Archive Link: "NFSv3: Summary of recent bugfixes/updates to the NFSv3 patches..."

Topics: FS: NFS, Networking, SMP

People: Trond MyklebustSteven N. Hirsch

Trond Myklebust reported:

A few longstanding, but important bugs have been fixed recently in the NFSv3 patches for linux. I'd therefore like to point out a few of these recent (past 3 weeks) fixes and cleanups:

SunRPC code
  1. NFS over TCP now works as expected. Previous versions were a tad unreliable, with transfer speeds being a factor 10 or so lower than the UDP (when the thing wasn't hanging 8-(). Now tests indicate a reduction in speed is of the order 20% on a local network, which is more in line with expectations. I'd be very interested in hearing other people's experiences here.
  2. Problem of waiting on socket buffer memory fixed. If a socket buffer runs out of write buffer memory (there's only 64k per socket), the code will now sleep (or send off requests to other NFS partitions) until there is enough free memory.

    This problem was the cause of a certain amount of network storms against Solaris machines. The latter prefer 32k wsizes (+ RPC call header), meaning that the socket could only buffer 1 request at a time. This lead to the transmission looping and sending off lots of headerless UDP fragments.

  3. Spinlocking for improved SMP safeness. In principle, the kernel lock should be held whenever we're in the RPCIOD code, or the NFS code. There was, however, some indication of corruption of wait queues on SMP-machines. Possibly due to manipulations of the queue in interrupts and/or bottom_halves?
NFS code
  1. A long-standing memory leak involving large reads has been fixed (if the read failed, the allocated pages beyond 4k were sometimes not being freed).
  2. Renaming of directories should now be fixed again.
  3. New stale inode detection. Is much more relaxed. This means that we avoid races with stale file handles. It should hopefully also work better than the old code against named sockets.

    In addition, the NFSv3 inode allocation code has been rewritten for greater clarity.

  4. The updating of 'atime' should hopefully be a bit more consistent. If we've been reading the file from the cache, then the cached value of atime won't get set back again by the next write/getattr/whatever statement that happens to return the server's idea of what atime should be.

The current version of the NFSv3 patches is 0.11.6. It should patch cleanly against stock linux-2.2.10, 2.2.11 and 2.2.12. You may pick up the latest NFSv3 patches at: http://www.fys.uio.no/~trondmy/src/linux-2.2.12-nfsv3.dif.bz2

Please note that in order to mount NFSv3 partitions, you will need a patched version of the 'mount' command. A Redhat 6.0 RPM can be found at http://www.fys.uio.no/~trondmy/src/nfsv3-mount/mount-2.9o-1.2.i386.rpm. The source RPM can be found in the same directory. In addition, there is a patch against stock 'mount-2.9o'.

Finally: please note the existence of a TODO list which should (hopefully) be updated regularly with known bugs/missing features etc. Please find the latter at: http://www.fys.uio.no/~trondmy/src/TODO

Steven N. Hirsch asked if these would apply cleanly to an otherwise-stock 2.2.12 kernel with HJ Lu's server patches previously added, then replied to himself after trying it: there had only been a few small conflicts, but even after installing the new kernel, a longstanding problem he'd been having (which he'd reported weeks before) still persisted. 'lockd' had been oopsing as soon as it tried to exercise locks against another server.

There was a bit of a bug hunt, some patches were exchanged, and finally some error output revealed the problem. Trond moaned, posted a patch, and said, "Another 'nlmclnt_' misnomer. I didn't remember that nlmclnt_async_call is used by the server code too when I put in the fix for 'setuid' processes which inherit open files.."

Steven reported success. End Of Thread.

13. Upcoming LDP Book: "Professional Linux Kernel Programming

31 Aug 1999 - 1 Sep 1999 (3 posts) Archive Link: "Call for Authors: Open Publishing licenced Kernel Programming Book"

Topics: SMP

People: Gary Lawrence Murphy

Gary Lawrence Murphy announced:

Logic and Reason have prevailed: Macmillan has agreed to publish our Kernel Programming Book under the OPL, and I have agreed to use docbook as the authoring process to ease later migration of all sections of the book back into the LDP.

This is not a "module programming" book; Alessandro is much better at that than I could ever hope to be, and that is the way it should be. "Professional Linux Kernel Programming" is a guide for people who need to get under the hood and leverage the freedoms of the GPL in employing Linux for specialized applications. There will be material on driver programming, but this is not the primary focus of this book. The book will also focus on the 2.2/2.3 kernels and while we hope to also include whatever is known about 2.4/2.5, there is very little need to support 2.0/2.1 (Beck & al have already done this admirably)

We are looking at a Dec 31 deadline for all submissions, and we are currently seeking contributing authors for sections including

When we are done, we hope to leave behind a complete opus of papers in the LPD to cover programming issues in the entire kernel. We have a table of contents online at http://members.xoom.com/teledynamics/book and invite your comments.

Anyone interested in participating in this project is invited to contact me directly at [email protected] or by phone at 519-422-2723

14. PCI Layer Broken In 2.3.x For All But i386

31 Aug 1999 (2 posts) Archive Link: "arch/alpha/kernel/bios32.c won't compile (2.3.15,2.3.16-1)"

Topics: PCI

People: Martin MaresRichard Henderson

Thorsten Kranzkowski reported that arch/alpha/kernel/bios32.c wouldn't compile in 2.3.15 and 2.3.16-pre1. Martin Mares explained, "2.3.15 contains new PCI layer code and only the i386 port has been updated to work with it. Richard Henderson has already fixed Alpha code, so expect it in the next kernel release."

15. NFS In The Linus Tree

31 Aug 1999 - 3 Sep 1999 (27 posts) Archive Link: "NFS under 2.2.12"

Topics: Disk Arrays: RAID, FS: NFS

People: Alan CoxSven GeggusMatthew KirkwoodFrank van MaarseveenMiquel van SmoorenburgDavid S. Miller

Robert K. Nelson was having NFS trouble, and in the course of discussion Alan Cox said, "2.2.12 itself seems to be rock solid as far as the 2.2.12 knfsd goes. I've also run the patches and tools from HJ Lu with no problems. The only reason they don't go in is the tool change." Sven Geggus pointed out that distributions like Red Hat were shipping without the new tools, so "If you get a 2.2.x Kernel from kernel.org running on a Redhat 6.0 Basesystem. NFS will no longer work :("

Alan confirmed this, saying, "Yep. Because the knfsd shipped with 2.2.x isnt good enough, but if I upgrade stuff in the kernel to need new tools people get all pissy and obnoxious as they did with raid. So you can apply the knfs patches yourself. I strongly recommend you and every vendor does that."

Matthew Kirkwood expressed his views:

At risk of turning linux-kernel into a me-too-fest I would very much like to see the knfsd patches go into 2.2.next (and also 2.3.next).

Alan has put in a lot of hard work to move 2.2 towards stability and I, for one, am more than a little disturbed to see that impeded by a few whiners who can't be bothered to upgrade one small support package.

At some stage, the distributors are going to have to bite the bullet and issue kernel updates. There were simply too many bugs fixed between 2.2.5(+bits) and 2.2.13pre for Red Hat (for example) to sit on these kernels until 6.1 (I hope).

So they're left with a choice of:

  1. 2.2.something (say 13) as-is. knfsd doesn't work.
  2. 2.2.13 + the knfsd patches that shipped in their original kernel. knfsd works, but not as well as it might.
  3. 2.2.13 + the new knfsd patches. They'll also have to issue a knfsd update. Everything works as well as is possible with the software currently available.

Option 3 is the only viable solution as far as I'm concerned.

Perhaps if 2.2.13 proves solid, we can leave the old-tool-people there, and push the knfsd patches into 2.2.14. That way, nobody needs to feel left out.

Frank van Maarseveen said to Alan, "Ahh, now I understand why these patches haven't gone into the kernel: you're concerned about the compatibility with existing linux 2.2.x distributions and installations. I've always thought the patches weren't stable enough yet."

Miquel van Smoorenburg also replied to Alan, saying:

Well Alan, with RAID the upgrade was pretty dangerous - you had to mess around with converting important, critical config files by hand (mdtab -> raidtab) and people feared they could lose their data. At least I did. And if you created a new style RAID partition there was no way to go back to a kernel < 2.2.12.

All of those facts do not apply to the kernel NFS server upgrade. It's just a tools upgrade, there is _no_ chance of data loss at all. If you want to go back to 2.2.12, downgrade the tools, be done.

David S. Miller added:

Another factor which people have to keep in mind is how many people are using RAID heavily in 2.2.x and not using the new RAID stuff.

And one cannot judge this simply by the loudness or number of the people who complain (unhappy people are loud, happy people are mostly silent). My personal judgement says there are more people slaving away at adding the RAID patches to the standard kernel than those who need to go through the RAID upgrade process.

16. kerneld

31 Aug 1999 - 1 Sep 1999 (3 posts) Archive Link: "Another Stupid Question - kerneld"

People: Robert DinseRiley Williams

Robert Dinse asked, "If you don't have loadable module support compiled into the kernel, is there any need to run kerneld?" and Riley Williams replied, "No - and if you're running a 2.2 kernel, there's NEVER any need to run kerneld either..." EOT.

17. Progress On Driver For Maestro Audio Chip

1 Sep 1999 (1 post) Archive Link: "more maestro pounding"

Topics: Sound: Maestro, Sound: OSS

People: Zach Brown

Zach Brown gave a pointer to OSS linux driver for the ESS Maestro family of audio chips and announced:

more progress on the OSS/maestro front. Output is feeling much better now. It doesn't leak memory and doesn't jitter when you don't feed it quickly enough. Happiness.

Its still quite sick on some chip/codec pairs, this is annoying me to no end. I'd appreciate if people could send me blindingly specific reports on hardware success/failure. Its expected to work on most simple maestro 1/2 single codec setups, but when you start getting into 2e multi codec docking nuttiness the forecast gets grim.

recording is still hopelessly broken and there are still weird mixer interface barfos, both of which I intend to address next.

evil mmap() hacks may happen eventually also.

18. Longstanding CDROM Bug Fixed For 2.3.x

1 Sep 1999 - 2 Sep 1999 (3 posts) Archive Link: "CDROM bug in 2.3.x"

Topics: Disks: SCSI

People: David S. MillerJens Axboe

David S. Miller said:

This one has been there for a while, and I finally sat down just now to track it down.

What I see is that the generic cdrom driver is passing down packet commands with a buffer length which is negative. I see that some of the cgc command building routines are setting negative buffer lengths on purpose.

This is illegal and is confusing SCSI drivers quite badly, because the scsi command will have SCpnt->request_bufflen < 0 and this will be given to the controller for the DMA request.

I don't know what the intentions were here, but this does need to be fixed somehow so I leave it to Jens to figure out the correct fix.

Jens Axboe replied, "It sets the negative buflens for the ide-cd driver, which otherwise assumes that the transfer is going in the wrong directions. It did seem odd to me at first, but I wasn't aware that the SCSI drivers got confused. I'll fix it up," and replied to himself with a patch the next day, adding, "ide-cd.c had some really odd code, where it expected transfers going to the drive to have negative buffer lengths..."

19. Modutils Maintainership Still In Dispute

1 Sep 1999 - 2 Sep 1999 (16 posts) Archive Link: "modutils/depmod doesn't support /lib/modules/*/usb"

People: Keith OwensBjorn Ekwall

Last week in Issue #33, Section #19  (25 Aug 1999: New Modutils Maintainer) , Keith Owens took over as maintainer of the modutils package. This week Bjorn Ekwall (the previous maintainer) resurfaced, apparantly unaware of Keith's actions. Bjorn announced an upcoming release of modutils, and (after a few folks put on asbestos jackets) Keith replied peachefully, "If Bjorn Ekwall wants to continue to maintain modutils then I will drop my 2.3 tree. If Bjorn wants to hand it over to me, I will merge his final 2.2 changes into my 2.3 tree. If somebody else wants to maintain modutils they can decide which versions they will use. No way am I going to fork this code."

There was no reply, so the issue seems still up in the air.

20. devfs V. 119 Announced

1 Sep 1999 (1 post) Archive Link: "[PATCH] devfs v119 available"

Topics: FS: devfs

People: Richard Gooch

Richard Gooch gave a pointer to his kernel patches page, and announced the release of version 119 of the devfs patch.

21. Uniform Driver Interface 1.0 Gets Cold Shoulder

1 Sep 1999 - 2 Sep 1999 (15 posts) Archive Link: "Universal Driver Interface spec available"

Topics: Disks: SCSI, Networking, Real-Time: RTLinux

People: Jeff GarzikDavid S. MillerAlan CoxBret Indrelee

Jeff Garzik gave links to the Uniform Driver Interface homepage, the UDI specs themselves (in PDF format), and a newsalert.com story covering the release of version 1.0; and said, "Members of Project UDI today announced the release of the UDI (Uniform Driver Interface) 1.0 Specification. This Specification is the culmination of a multi-company development effort designed to provide device driver portability for existing and future system configurations. UDI supports today's key I/O technologies and is designed with an extensible architecture that can easily accommodate future I/O technologies and products."

David S. Miller was opposed to the whole idea, saying, "No thanks, IMHO OS neutral driver interfaces are a nice idea but they can only lead to mediocrity. (Yes I have read and understand how your stuff works, the problem will still be there)." And Alan Cox added, "Not sure why anyone thinks this is Linux relevant 8) - other than it will help to make our drivers even faster than the competition if they adopt it. Have a read, but keep a bucket handy"

Bret Indrelee replied more moderately:

Actually there are a couple of reasons it is relevant to Linux.

There is already code out there to run UDI drivers on a Linux system. If you look, one of the demonstration systems was a Linux system. Intel did the port.

Intel has also made statements to the effect of reference drivers for it's hardware are most likely going to be UDI. If this happens, you are going to want UDI support just so you can bring up new Intel hardware fast.

With some work on the UDI interface, you should be able to make it so all UDI drivers are RTLinux friendly. Also, you could change the mutex/semaphore/locking code as many times as your little hearts desire without having to rewrite every UDI driver. The UDI interface has all of the mutex and timed waits happening outside the driver, in the UDI support routines.

I would have thought that people would rather spend their time working on improving the operating system rather than rewrite an ethernet or SCSI device driver for the umpteenth time. That is what UDI is intended to allow you to do, focus on the OS and use standard drivers.

Alan was unsympathetic, saying:

I've read the UDI spec in detail. Its about fit for BIOS loaders but thats it. If intel provide source it will be worth porting their drivers to the OS properly. If intel don't provide source we don't care anyway.

You simply cannot express stuff like the linux parport sharing in UDI, its got no equivalence. Out goes parallel devices. You can't portably express the Linux tty layer, out go tty devices. It's too slow for serious networking and it can't properly express our scsi stuff and make good use of it.

So what are you going to do with it. Joysticks ?

Elsewhere, he added, "I read the 0.9 spec. I thought "oh dear". I read the 1.0 spec and thought "oh well""

22. Some Explanation Of Threading

1 Sep 1999 - 3 Sep 1999 (9 posts) Archive Link: "Re: Threads in Linux"

Topics: Executable File Format, SMP, Virtual Memory

People: Matthew KirkwoodJim NanceVladimir Dergachev

Vladimir Dergachev asked a number of questions about Linux threading. First, he asked if two threads from the same process could run on two different CPUs. Matthew Kirkwood replied, "Simple answer: yes. Fuller answer: there is no such thing as "two threads from the same process". Under Linux, a thread is a process is a thread. Threads commonly share VM and open files, but they may share more or less, depending upon the application. Just as processes may run concurrently on different CPUs, so may threads, them being one and the same."

Vladimir also asked what happened in the event of cache thrashing; specifically he wanted to know if any optimizations were made for it. Matthew replied that this was a user-space situation. He explained, "There is code to keep the processes on the same CPU in some circumstances, but cache-thrashing is largely an application effect. If you have your hogs battering away at the same buffer, then the bus could well get swamped."

Next, Vladimir asked if threaded programs were guaranteed to not run slower on SMP than on UP, and also asked if the situation was like that on all releases (2.0.*, 2.2.* and 2.3.*). Matthew replied that there were no guarantees and never had been, because it wasn't a real time system. Jim Nance added some clarification to the statement that the situation had always been the same, with, "Well sort of. Linus always planned to do threads this way, but its only been relatively recently that working pthread libraries have appeared to take advantage of it. For a long time linux'es pthread implementation was implemented in user space and all threads ran inside of 1 process. I think this was still the case when we switched from a.out to ELF libraries. At some point Xavier Leroy wrote a clone() based pthread implementation that is the basis of what we use today. I do not think that it was a part of libc5, but you could patch up a libc5 system to use it. Xavier's package is a standard part of glibc, so you have a reasonable pthreads implementation on all newer Linux distributions."

Finally, Vladimir also asked if there were there any way for a process to request that all its threads be run on the same cpu. Matthew reiterated that processes and threads were the same thing, and added, "That aside, there are patches which allow processes to request that they run on only a certain set of CPUs. You could use them, though really, it's the kernel's job to ensure that the application doesn't have to do things like this."

23. DIPC Goes GPL

1 Sep 1999 (1 post) Archive Link: "DIPC 2.0-pre10"

Kamran Karimi gave a pointer to the Distributed Interprocess-Communication (DIPC) homepage and announced that DIPC 2.0-pre10 was available at ftp://orion.cs.uregina.ca/pub/dipc. He added that as of version 2.0-pre5, DIPC was being released under the GPL

24. Storing Driver Data In /proc

2 Sep 1999 (19 posts) Archive Link: "[patch] RFC: /proc/module namespace"

People: Jeff Garzik

Jeff Garzik posted a patch against 2.3.16 to create a /proc/module directory, with functions to allow a module to create its own directory /proc/module/{MODULE_NAME} in which to put its data. Jeff wanted comments on the proposal and the code.

Fuzzy Fox asked if this meant that a driver's data would go in a different place depending on whether it was compiled as a module or as part of the kernel. There was some discussion, in which it was pointed out that drivers and modules were not necessarily interchangable concepts. A little over 12 hours after posting his original patch, Jeff posted a slightly revised version, that would use the /proc/driver directory instead of /proc/module

25. MVP4 sound Support

2 Sep 1999 (6 posts) Archive Link: "MVP4 sound Support"

Topics: Sound: SoundBlaster

People: Anthony BarbachanJeff GarzikAlan Cox

Anthony Barbachan said, "I've got a motherboad with an MVP4 chipset. This chipset has integrated video and sound. Its suppose to have backward Soundblaster compatability however sound still doesn't work. I was wondering if anybody was working on a sound driver for this integrated sound in this chipset. If not I might write one myself." Jeff Garzik replied, "Funny you should say that. :) I just e-mailed such a driver to Alan Cox for inclusion in future kernels. You can get stable version 1.0.0 at http://havoc.gtf.org/garzik/kernel/files/vt82cxxx-audio-2.3.16.patch.gz" . Anthony replied, "Alright!!! Great the sound was the only thing left to have this system fully capable."

26. ACPI In The Kernel

2 Sep 1999 - 3 Sep 1999 (3 posts) Archive Link: "ACPI still breaks PIIX4 IDE in 2.3.16"

Topics: Disks: IDE, Power Management: ACPI, USB, Virtual Memory

People: Simon RichterJeff Garzik

In the course of discussion, Jeff Garzik said that real ACPI was going into the kernel soon, but Simon Richter replied, "As the guy who told you that ACPI would go off soon I need to put that into the right direction: Real ACPI support is under way, but still far from being usable. :-/ The patch I expected to be able to send out yesterday did not go out, for reasons I state in my next post."

Under the Subject: The future of ACPI4Linux, Simon said:

As some of you might already know, there are now (at least :-) ) two concurrent approaches to implementing ACPI. I do not think that concurrently developing them is a good idea, and I hope you all agree on this. I will summarize the advantages and disadvantages of these solutions briefly:

The "classic" ACPI4Linux patch: This is a rather huge patchset which already does some things that do not require the AML VM to work. Its greatest strength is that it contains much useful code, its greatest weakness size and complexity. I believe that following this track will need enlarging the kernel by about 300k before we can even dare to activate any real features.

A modular approach: This is a solution that is entirely module-based as opposed to the ACPI4Linux patch which is not modularisable. It should work on 95% of all PCs without patching the kernel, and on 100% with a 2k sized patch to memory management. The current concept leaves the effective work to a userspace daemon, the module itself just registers the interrupts and reaches the events down to userspace. Its greatest strength is simplicity, its greatest weakness the fact that ACPI devices need to be initialized by an ACPI enumerator to work correctly, which would require e.g. the IDE driver to reinitialize the device while the machine is already up and running, a thing I would not like.

Both patches will require greater changes to other drivers, such as IDE and USB later on.

I, personally, would vote for the ACPI4Linux patch because testing is easier if you play with the hardware at a stage where you cannot do much damage, but other people, including Max Berger, who is the "kernel guy" of the project (I'm more into userland code) think different on this issue [Max, Andy, I think it would be good to post something more about the module here].

He asked for opinions, but there was no reply.

27. Support For Kallisto GPS Cards

2 Sep 1999 (1 post) Archive Link: "New Driver For Kallisto GPS Card"

David Skingsley gave a pointer to his kallisto page, and announced he had written a driver for the Kallisto GPS card, accurate to between 1 - 5 microseconds

28. Satisfied User Sees Speed Improvements In 2.3.x

3 Sep 1999 - 4 Sep 1999 (4 posts) Archive Link: "Responsiveness."

People: Andrea ArcangeliBill Huey

Bill Huey noticed a tremendous leap of responsiveness between 2.3.10 and 2.3.16; Andrea Arcangeli pointed out that his page-LRU patch and scheduling patch both came into the kernel around 2.3.16.

29. User-Mode Kernel V. 2.3.15-2um Announced

4 Sep 1999 (1 post) Archive Link: "user-mode kernel 2.3.15-2um"

People: Jeff Dike

Jeff Dike gave a pointer to his Kernel Profiling With Gprof page, his Kernel Code Coverage Analysis With Gcov page, and the Linux User-Mode Kernel page; and announced that profiling with gprof, test coverage with gcov, and networking in general all worked on the latest version (2.3.15-2um) of his user-mode kernel.

30. Some Explanation Of Locking

4 Sep 1999 - 5 Sep 1999 (14 posts) Archive Link: "SMP linux help"

Topics: SMP

People: Sushil AgrawalVictor KhimenkoMatthew WilcoxMark CookeJamie LokierIngo MolnarNate Eldredge

Sushil Agrawal noticed that in order to be SMP safe, the kernel was obliged to go use various locking mechanisms; for example he pointed out that do_fork(), as well as many other system calls, used the lock_kernel() function. He asked, "What does lock_kernel() do? If it is used to synchronize the access to kernel data structures from other processes on other processors, then why, after having called this function, we again try to do some locking like calling spin_lock(&lastpid_lock) in get_pid()?"

Victor Khimenko explained, "Initially just one kernel lock protected everything in kernel. It was not SMP-friendly. So global kernel lock was replaced by bunch of smaller locks. Some things are still protected by global lock but some things need personal lock (things access for which are time-critical mostly)... It's not trivial change since you need to take care about potential deadlock situations..."

Sushil asked if this change meant that spin_lock(&lastpid_lock) and read_lock(&tasklist_lock) in the get_pid() function in fork.c were redundant, and Matthew Wilcox explained that the spinlocks were used so the kernel lock could be dropped.

Nate Eldredge had also been wondering about locking, and asked about the situation on SMP systems when one CPU had a kernel lock. He wanted to know how the other CPU was handled. Matthew explained, "If another CPU holds the kernel lock, we spin until it has released it, just as any other spinlock (<asm-i386/smplock.h>). lock_kernel prevents any other processor from simultaneously executing any other code that is also protected by the big kernel lock." He added, "spinning means constantly polling a lock until it's available."

Mark Cooke also explained:

When you have the kernel lock you can guarantee that accesses to anything else protected by the kernel lock can't happen.

So, generally you:

  1. Acquire the kernel lock You can now guarantee that no other access to code/data protected by the kernel lock can happen.
  2. Mess with shared lists / page tables / etc
  3. Let go of the lock

The big kernel lock used to protect lots of stuff. Ie, acquiring the lock used to stop a whole lot of stuff running.

These days, the big lock is being replaced with more localised locks, to improve the granularity of access. Hence, you can have 1 CPU doing one kernel task, and another CPU doing something unrelated, provided they are protected by different locks.

Jamie Lokier also clarified:

Actually the kernel lock is different to other spinlocks.

When a task holding the kernel lock blocks, i.e. calls schedule(), the lock is dropped. When the task is run again, the lock is reacquired.

Ordinary spinlocks are different. You must never call schedule() while holding an ordinary spinlock. This means no wait_on_blah() or down() calls in the critical region, and no non-atomic memory allocation. These rules don't apply to semaphores used as mutexes.

Ingo Molnar said that the kernel lock could best be described as a "recursive-spin-semaphore". He added, "We might want to change it to a real semaphore in the future, now that the locked regions are getting much longer (multi-millisec). 2.5 stuff i guess."

31. Fixing The SCSI Layer

4 Sep 1999 - 6 Sep 1999 (13 posts) Archive Link: "Fixing the SCSI layer"

Topics: Disks: SCSI

People: Alan CoxMatthew Jacob

Alan Cox posted some rough patches, and reported:

I've been trying to debug a high performance fibrechannel HA under Linux. I'm now in a situation where its 50/50 whether the bugs remaining are in the scsi layer or the adapter. The scsi code however is so hard to read its going to be quicker to clean it up and fix bits than to try and debug it further. Is anyone else trying to clean up the scsi mess currently?

I'm not going to write some new scsi layer, sorry someone crazier can do that just to clean up the cruft, including stuff like banging it all through indent and extracting common code into functions, moving long complex conditional code into functions etc.

Terry Hardie was struggling with a driver for the OnStream 50GB tape drives, but looking over the SCSI midlayer was getting him nowhere. He asked if he should wait for that code to get tidied up, or would that be too far in the future. Alan replied, "2.3.17 is when the cleanup starts"

Elsewhere, Matthew Jacob replied to Alan's original post, asking what specific problem was being addressed. Alan replied, "Firstly to make it debuggable by cleaning the code up." Later, he added, "Until the code has been cleaned up, who knows. Linus certainly wants the midlayer structure to die so that scsi disks become ordinary block I/O devices and themselves call into scsi helper routines if they wish."

32. Race Conditions In File Creation In 2.3.x

4 Sep 1999 - 5 Sep 1999 (7 posts) Archive Link: "Race conditions in file creation in 2.3."

Topics: FS: NFS, FS: ext2

People: Manfred SpraulAlan CoxDavid S. Miller

Alan Cox reported that under 2.3.16, the command, "cp -vrf /usr/bin /mnt & -cp vrf /usr/bin /mnt & cp -vrf /usr/bin /mnt" would give a lot of 'file not found' errors when 'cp' issued the create() system call. David S. Miller asked if this was going over NFS, but Alan replied that it was a straight ext2 to ext2 copy. Manfred Spraul suggested that write() and truncate() might be the culprits instead of create(), since they weren't syncronized. Alan replied that he now thought the error was coming from open(). Manfred posted a small program to reproduce the error, and added that this had been a known problem since approximately 2.3.7; and gave this explanation:

Unfortunately, it seems difficult to fix it properly:

  1. affected files: several places outside VFS access i_sem directly: [at least]
  2. different types of files have completely different synchronzation requirements, so I think flags should be added to VFS which sync calls are required.

33. Assembler Bug

4 Sep 1999 - 5 Sep 1999 (7 posts) Archive Link: "[ix86 only] as86 bug?"

People: Riley WilliamsAlan Cox

Riley Williams reported:

I've discovered a bug in the as86 assembler which was causing my attempts to make the ix86 boot code correctly handle kernels larger than 1M to play havoc with my system. Since I've now isolated the effects of the bug, I'm reporting it for those interested therein.

The problem is with as86's handling of the .org directive, and specifically with the case where the directive specifies an address lower than what would be used in the absence of the .org directive. When this occurs, as86 does one of two things, with there being no apparent pattern to its choice of which it does:

Since I was tweaking the i386 boot sector when this bug hit me, I was less than amused...

Alan Cox warned, "Please use dev86. The old as86 tools vendors ship are about 3 years out of date and unmaintained." Later, he added, "All of them, every vendor."

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.