Kernel Traffic #333 For 2005/11/13

By Zack Brown

I'm definitely only interested in machines that are out there, not some theoretical issues.

--Linus Torvalds

Table Of Contents

Mailing List Stats For This Week

We looked at 1656 posts in 10MB. See the Full Statistics.

There were 580 different contributors. 229 posted more than once. The average length of each message was 104 lines.

The top posters of the week were: The top subjects of the week were:
41 posts in 181KB by andrew morton
40 posts in 165KB by linus torvalds
36 posts in 129KB by jeff garzik
36 posts in 128KB by greg kh
36 posts in 282KB by hugh dickins
42 posts in 200KB for "vfs: file-max limit 50044 reached"
42 posts in 206KB for "[patch] use of getblk differs between locations"
40 posts in 176KB for "2.6.14-rc4-rt1"
36 posts in 221KB for "[discuss] re: x86_64: 2.6.14-rc4 swiotlb broken"
35 posts in 219KB for "[patch 0/8] nesting class_device patches that actually work"

These stats generated by mboxstats version 2.8

1. HT And Multi Core Detection Cleanup

Wed Oct  5  - Thu Oct 13  (14 posts) Archive Link: "[Patch] x86, x86_64: Intel HT, Multi core detection code cleanup"

People: Suresh B. SiddhaAndi Kleen

Suresh B. Siddha said:

This patch cleans up the x86 and x86_64 Intel HT and Multi Core detection code. These are the areas that this patch touches.

a) Cleanup and merge HT and Mutli Core detection code for x86 and x86_64

b) Fields obtained through cpuid vector 0x1(ebx[16:23]) and vector 0x4(eax[14:25], eax[26:31]) indicate the maximum values and might not always be the same as what is available and what OS sees. So make sure "siblings" and "cpu cores" values in /proc/cpuinfo reflect the values as seen by OS instead of what cpuid instruction says. This will also fix the buggy BIOS cases (for example where cpuid on a single core cpu says there are "2" siblings, even when HT is disabled in the BIOS.

c) Fix the cache detection code assumption that number of threads sharing the cache will either be equal to number of HT or core siblings.

Andi Kleen noticed some #ifdefs that had to go. He said Suresh was trying to share too much code between architectures. They went back and forth a bit on what exactly Andi wanted, and at one point Andi said, "I would prefer if the Intel CPU detection support wasn't distributed over so many small files. If you prefer to share it put it all into a single file and share that. But please only for code that can be cleanly shared without ifdefs." He added, "Also in general it would be better if you first did the cleanup and then as separate patches the various functionality enhancements.That makes the changes easier to be reviewed and it helps in binary search when something goes wrong." Suresh replied, "Lets defer this code sharing to some other time. I want to make sure that -mm tree (and finally 2.6.15) picks up these enhancements first, before I start my vacation :)" He posted a patch, and he and Andi continued to discuss the technical details.

2. Support For Upcoming MPCBL0010 Single Board Computer

Thu Oct  6  - Fri Oct 14  (19 posts) Archive Link: "Fwd: Telecom Clock Driver for MPCBL0010 ATCA computer blade"

People: Mark GrossAlexey DobriyanGreg KH

Mark Gross said:

Attached is a simple charactor driver for possible inclusion in your MM tree.

This driver is specific to the MPCBL0010 ( that will start shipping this fall.

The telcom clock is a special circuit, line card PLL, that provids a mechanism for synchronization of specialized hardware across the backplane of a chassis of multiple computers with similar specail curcits. In this case the synchronization signals get routed to multiple places, typically to pins on expansion slots for hardware that knows what to do with this signal. (SONET, G.813, stratum 3...) and similar signaling applications found in telcom sites can use this type of thing.

The actual device is hidden behind the FPGA on the motherboar, and is connected to the FPGA via I2C. This driver only talks to the FPGA registers.

Alexey Dobriyan noticed that Mark's driver provided both an ioctl and SysFS interface. He said, "Can you drop ioctl part of interface and leave only sysfs one?" Mark replied, "I would like to keep if for a little while because the hardware validation guys are still using test harnesses written for the 2.4 kernel driver version. However; I am willing to pull this block if that would help in getting this driver into the kernel."

Elsewhere Greg KH had a lot of small suggestions, which Mark addressed, at which point Greg said, "This patch looks good, I have no objections to it."

3. Ruminations On Pathname Encodings For git

Fri Oct  7  - Fri Oct 14  (33 posts) Archive Link: "[RFC] embedded TAB and LF in pathnames"

People: Linus TorvaldsPaul EggertDaniel BarkalowH. Peter Anvin

On the git mailing list, the question came up of how git should deal with tabs and line-feeds in path names. At one point in the discussion, Paul Eggert started to speculate on how to support systems using different character encodings for filenames; and Linus Torvalds replied:

Please don't. Use filenames as if they are just binary blobs of data, that's the only thing that has a high chance of success. Yes, it too can break in the presense of something _else_ doing character translation and/or people moving a patch from one encoding to another , buthat's just true of anything.

Eventually everybody will hopefully use UTF-8, and nothing else really matters, but the thing is, if you see filenames as just blobs of data, that works with UTF-8 too, so it's not "wrong" even in the long run. And until everybody has one single encoding, you simply won't be able to tell, and the likelihood that you'd screw up is pretty high.

The happy part of the "binary blob" approach is that users _understand_ it. People who actively use different encoding formats are (painfully) aware of conversions, and they may curse you for not doing the random encoding format of the day, but they will be able to handle it.

In contrast, if you start doing conversions, I guarantee you that people will _not_ be able to handle it when you do something strange - you've changed the data.

Personally, I'd like the normal C quoting the best. Leave space as-is, and quote TAB/NL as \t and \n respectively. It's pretty universally understood in programming circles even outside of C, and it's not like a very uncommon patch format like that really needs to be well-understood outside of those circles.

It also has a very obvious and ASCII-safe format for other characters (ie just the normal octal escapes: \377 etc..

That said, I personally don't think it's necessarily even worth it. If somebody wants to use names with tabs and newlines, is he really going to work with diffs? Or is it just a driver error?

Paul Eggert replied, "Thanks for thinking those things through. I agree mostly, but there's still a technical problem, in that we have to decide what a "funny byte" is if we are using C-style quoting. For example, the simplest approach is to say a byte is funny if it is space, backslash, quote, an ASCII control character, or is non-ASCII. But this will cause perfectly-reasonable UTF-8 file names to be presented in git format using unreadable strings like "a\293\203\257b" or whatever." Linus replied:

I think the simplest question to ask is "what are we protecting against?"

There's only two characters that are _really_ special diff itself: \n and \t. The former is obvious, the latter just because the regular gnu diff format puts a tab between the name and the date (and if you _knew_ the date was always there you could just work backwards, but since not all diffs even put a date, \t ends up being special in practice).

So what else would you want to protect against? I hope not 8-bit cleanness: if some stupid protocol still isn't 8-bit clean, it should be fixed.

And \0 is already impossible, at least on sane systems.

So arguably you don't need to quote anything else than \n and \t (and that obviously means you have to quote \ itself). That means that any filename always shows "sanely" in its own byte locale, and everything is readable, regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or anything else.

So I don't think you should quote invalid UTF-8: it's invalid UTF-8 whether ??tis quoted or not.

Paul said, "we don't have to come up with something that's perfect in all cases, just something that's good enough to handle cases that we expect will be common in practice, in a world where UTF-8 is the preferred encoding for non-ASCII characters." And Linus replied:

The thing is, I can almost guarantee you that any quoting in the high characters is going to be _worse_ than no quoting at all.

Exactly because quoting as UTF-8 is the wrong thing when it isn't actually UTF-8, and quoting as non-UTF-8 is the wrong thing when it _is_.

Not quoting at all, on the other hand, is unambigious. If you have a mailer that corrupts your text stream (which-ever type it is), then it's clearly the mailers problem. The _mailer_ at least has a chance in hell to know what character set it is getting mailed as.

The other alternative is to quote _everything_ non-ASCII. That's definitely reliable, but it's also unquestionably ugly as hell, especially in the long run.

Yes, there are some complex quoting approaches you can do, which quote things "correctly" (ie at a byte stream level) _and_ keep it valid UTF-8 at the same time.

For example, you can read it as a UTF-8 stream, but then quote things at a byte level (ie if you quote one "character", you quote _all_ bytes in that character). And you quote if:

but quite frankly, that's a pretty painful thing to write. The upside is that it's easy to decode: you can _unquote_ it just as a byte stream.

Paul raised an eyebrow at the second item in that list, saying, "Why quote the raw bytes? Is this for terminal escapes on older xterm (or xterm-like) implementations that don't understand UTF-8? If so, I'm not sure I'd bother, as it would introduce a lot of annoying quoting with perfectly reasonable UTF-8, and (if we assume the world is moving to UTF-8) it addresses a problem that is going away." Linus said:

UTF-8 is only _now_ getting really widespread, and I think it's because RedHat bit the bullet and made UTF-8 the default locale a few years ago.

These things take _decades_.

I don't know if you realize it, but it's only within the last couple of years that the old 7-bit "finnish ASCII" went away. Finnish and Swedish have three extra characters: ??? (latin1) and ?????? (utf-8). But only within the last few years has the really _old_ ASCII representation really gone away so much that I don't see it at all (the characters '{' '}' and '|' were taken over, so that if you had a Finnish ASCII font, programming in C was really funky - but it was common enough that I could do it without thinking much about it ;)

So lots of people still use the byte-wide encodings. Whether really old ASCII only or some special locale-dependent one (of which latin1 and the "win-latin1" thing are obviously the most common by far). And in that locale, it's not the UTF-8 control characters that matter, it's the _byte_ control characters that do.

So if you want to support any other locale than UTF-8, you need to escape them. Assuming you want to escape control characters at all, of course (I still think it's perfectly fine to just let the raw mess through and depend on escaping at higher levels)

Daniel Barkalow remarked, "I think it's actually sufficient to escape 0x00-0x1f and 0x7f; those ranges are both easy and, as far as I can tell, include all of the control characters that do annoying things. I think escape, backspace, delete, and bell are the only ones we'd rather the terminal not get; beyond that, patches with screwy filenames look screwy, but don't screw up anything outside of the filename." Linus agreed that 0x00-0x1f and 0x7f would be easy; but he said those didn't cover all annoying cases. He said:

The traditional vt100 escape sequence is "ESC" followed by a character to indicate the type of sequence (the most common one is '['). That's all 7-bit and fine.

HOWEVER, they made the 8-bit extension be such that any of these vt100 begin sequences where the second character is in the appropriate range can be instead shortened by one character, by instead using a single 8-bit character of "0x80+(char-0x40)". Ie the traditional "ESC + '['" (\x1b\x5b) can also be written as a single '\x9b' character, aka CSI.

In other words, 0x80-0x9f are _all_ just vt100 shorthand for ESC+'@' through ESC+'_'.

(I guess it's not strictly "vt100" any more - it's the extended vt220 format).

H. Peter Anvin added:

Actually, it's even trickier than that.

CSI is character 0x1b of control code set C1; there are two "windows" for control codes -- CL (0x00-0x1f) and CR (0x80-0x9f). Normally CL is mapped to C0 and CR is mapped to CL, but ESC will temporarily map C1 into CL.

VT1xx didn't support this since they didn't support 8-bit anything.

Anyway, a *lot* of character sets -- not just UTF-8 -- use the CR range of bytes for printables.

4. git On OpenBSD

Mon Oct 10  - Thu Oct 13  (17 posts) Archive Link: "openbsd version?"

People: Linus Torvalds

On the git mailing list, Randal L. Schwartz noticed that the git Makefile listed OpenBSD as a supported platform; however, when he tried to compile for that target he got errors. Linus Torvalds said he should use make NO_STRCASESTR=1, "or add that explicitly to the makefile in the OpenBSD rules and send Junio a tested patch ;)" Junio C. Hamano said this had been fixed. Randal tried it, and successfully got git working under OpenBSD. He also volunteered to help write some git documentation.

5. man-pages Version 2.08 Released

Wed Oct 12  - Thu Oct 13  (4 posts) Archive Link: "man-pages-2.08 is released"

People: Michael KerriskJesse Barnes

Michael Kerrisk said:

I recently released man-pages-2.08, which contains sections 2, 3, 4, 5, and 7 of the manual pages. These sections describe the following:

2: (Linux) system calls
3: (libc) library functions
4: Devices
5: File formats and protocols
7: Overview pages, conventions, etc.

As far as this list is concerned the most relevant parts are: all of sections 2 and 4, which describe kernel-userland interfaces; in section 5, the proc(5) manual page, which attempts (it's always catching up) to be a comprehensive description of /proc; and various pages in section 7, some of which are overview pages of kernel features (e.g., networking protocols).

This is a request to kernel developers: if you make a change to a kernel-userland interface, or observe a discrepancy between the manual pages and reality, would you please send me (at [email protected] ) one of the following (in decreasing order of preference):

  1. An in-line "diff -u" patch with text changes for the corresponding manual page. (The most up-to-date version of the manual pages can always be found at or
  2. An email describing the changes, which I can then integrate into the appropriate manual page.
  3. A message alerting me that some part of the manual pages does not correspond to reality. Eventually, I will try to remedy the situation.

Obviously, as we get further down this list, more of my time is required, and things may go slower, especially when the changes concern some part of the kernel that I am ignorant about and I can't find someone to assist.

To give an idea of the kinds of things that are desired as manual page additions/improvements, I've shown extracts from the man-pages-2.08 Changelog below.

Elsewhere, he also added, "the greatest part of credit must go to Andries, the maintainer for nearly 10 years. I'm shortly coming up to my first anniversary..."

Jesse Barnes replied, "Would it make sense for some of the man pages (or maybe all of them) that correspond directly to kernel interfaces (e.g. syscalls, procfs & sysfs descriptions) to be bundled directly with the kernel? Andrew is generally pretty good about asking people to update the stuff in Documentation/ when necessary, so maybe the man pages would be kept more up to date if developers were forced to deal with them more directly." Michael replied, "Recently, I was just wondering the same thing. However, there are complexities to consider. C libraries (okay, glibc is the main one I concern myself with) sometimes add some functionality in the wrapper function for a particular system call. This also needs to be documented in the Secion 2 page." But he added, "Nevertheless, I think the idea of binding the kernel sources and Sections 2 and 4 of the manual pages a bit more tightly bears some consideration. In the ideal world, when a change is made to the kernel, the patch could include adjustments to the man pages (if relevant) -- then the changes could follow the patch through the -mm tree and then into Linus's tree."

6. Support For Sharp SL-5500 Touchscreen

Thu Oct 13  (1 post) Archive Link: "Sharp sl-5500 touchscreen support"

People: Pavel Machek

Pavel Machek posted a patch, saying, "This adds support for sharp zaurus sl-5500 touchscreen. It introduces some not-too-nice ifs, but I guess copying whole ucb1x00-ts.c would be bad idea..."

7. Support For Sharp SL-5500's PCMCIA Slot

Thu Oct 13  (1 post) Archive Link: "Support pcmcia slot on sharp sl-5500"

People: Pavel Machek

Pavel Machek posted a patch, saying, "This adds support for pcmcia slot on sharp zaurus sl-5500. pxa2xx_sharpsl.c thus becomes quite miss-named, but I guess that is not worth fixing"

8. Support For SGI Atomic Memory

Fri Oct 14  - Tue Oct 18  (23 posts) Archive Link: "[Patch 0/3] SGI Altix and ia64 special memory support."

People: Robin Holt

Robin Holt said:

SGI hardware supports a special type of memory called fetchop or atomic memory. This memory does atomic operations at the memory controller instead of using the processor.

This patch set introduces a driver so user land can map the devices and fault pages of the appropriate type. Pages are inserted on first touch. The reason for that was hashed out earlier on the lists, but can be distilled to node locality, node resource limitation, and application performance.

Since a typical ia64 uncached page does not have a page struct backing it, we first modify do_no_page to handle a new return type of NOPAGE_FAULTED. This indicates to the nopage handler that the desired operation is complete and should be treated as a minor fault. This is a result of a discussion which Jes Sorenson started on the the ia64 mailing list and Christoph Hellwig carried over to the linux-mm mailing list.

The second patch introduces the mspec driver.

I am reposting these today. The last version went out in a rush last night and I did not take the time to notify the people that were part of the earlier discussion.

Additionally, the version which Jes posted last April was using remap_pfn_range(). This version uses set_pte(). I realize that is probably the wrong thing to do. Unfortunately, we need this to be thread-safe. With remap_pfn_range() there is a BUG_ON(!pte_none(*pte)); in remap_pte_range() which would trip if there were multiple threads faulting at the same time. To work around that, I started looking at breaking remap_pfn_range() into an _remap_pfn_range() which assumed the locks were already held. At that point, it became apparent I was stretching the use of remap_pfn_range beyond its original intent. For this driver, we are inserting a single pte, the page tables have already been put in place by the caller's chain, why not just insert the pte directly. That is what I finally did.

9. git 0.99.8e Released

Sat Oct 15  - Mon Oct 17  (7 posts) Archive Link: "GIT 0.99.8d"

People: Junio C. Hamano

Junio C. Hamano announced:

GIT 0.99.8d is available as usual at:

RPMs and tarball: (
Debs and tarball: (

In addition to accumulated bugfixes, there is one important futureproofing changes.

The "master" branch has changes to git-upload-pack (which would affect what git-fetch-pack/git-clone-pack see) and git-update-server-info (which would affect what fetch and clone over http:// transport see) to send extra information about the available references, so that the clients can find out what objects are referenced by remote tags before downloading them. They take the form of "tagname^{}". "git ls-remote $repository" command would show something like this:

    7a3ca7d2b5ec31b2cfa594b961d77e68075e33c7        refs/heads/master
    5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c        refs/tags/v2.6.11-tree
    c39ae07f393806ccf406ef966e9a15afc43cc36a        refs/tags/v2.6.11-tree^{}
    c2bbf523f1d454649897b3e4bcd71778e4fa5913        refs/tags/v2.6.14-rc2
    676d55ae30ea3b688f0386f70553489f25f24d55        refs/tags/v2.6.14-rc2^{}
    f92737b18abac90af30ac26a050fda879c9b238b        refs/tags/v2.6.14-rc3
    1c9426e8a59461688bb451e006456987b198e4c0        refs/tags/v2.6.14-rc3^{}

when the server side updates to the version in the "master" branch. These "^{}" entries describe the SHA1 of the object the tag object points at (so v2.6.11-tree tag, whose object name is 5dc01c... points at a tree object whose object name is c39ae0...).

The downloading clients (git-clone and git-fetch) in the "master" branch have been taught to recognize these entries; after all, these are not real refs and you cannot give them to git-http-fetch to fetch from. GIT 0.99.8d clients have the same change, so that people staying with the maintenance branch can download from the server that already runs the "master" version and sends these fake references without getting confused.

upload-pack and update-server-info in GIT 0.99.8d would not show these extra "fake refs" when used on the server side. In other words, 0.99.8d is to keep the maintenance branch working with newer servers.

There will be GIT 0.99.8e at around the time "master" branch will get the updated "git-diff-*", for similar purposes. The updated "git-diff-*" commands deal with pathnames with funny characters (most importantly tabs and newlines) in a way compatible with the proposed change to GNU patch, which was outlined in:

The change to "git-diff-*", and corresponding change to "git-apply" are cooking in the proposed updates branch right now. When people start generating diffs with them, patches that touch paths that have double-quotes '"' or spaces ' ' in them need to be applied with the updated git-apply that knows how new "git-diff-*" encodes these funny pathnames. GIT 0.99.8e is planned to backport the necessary git-apply changes, in case we do not bump the major release number by then.

Later, Junio said:

GIT 0.99.8e is available as usual at:

RPMs and tarball: (
Debs and tarball: (

The "master" branch has updated "git-diff-*" commands, that deal with pathnames with funny characters (most importantly tabs and newlines) in a way compatible with the proposed change to GNU patch, which was outlined in:

When people start generating diffs with them, patches that touch paths that have double-quotes '"' or spaces ' ' in them need to be applied with the updated git-apply that knows how new "git-diff-*" encodes these funny pathnames. GIT 0.99.8e contains the necessary backport of the git-apply changes.

This will hopefully be the last 0.99.8 maintenance release.

10. Load Issues With gitweb.cgi On

Mon Oct 17  - Wed Oct 19  (15 posts) Archive Link: "gitweb.cgi"

People: H. Peter AnvinBrian Gerst

On the git mailing list, H. Peter Anvin said to Kay Sievers:

It is increasingly clear that gitweb.cgi is producing an unacceptable load on the servers. Most of the hits we get are either the gitweb front page or the gitweb rss feeds, and it's eating I/O bandwidth like crazy.

This has become particularly painful during the current one-server outage.

Kay, gitweb really needs to be able to do caching, or be run behind a caching proxy. Otherwise I will have to turn it off until we can come up with a dedicated piece of server hardware for it.

Kay suggested Apache's mod_cache, and Peter replied, "I set up mod_cache (which I didn't know about, silly me) and so far it seems to work and has produced a tremendous decrease in load and improvement in response time. I do, have, however, a request. There are some gitweb pages which are more likely to change than others; in particular, some gitweb pages will *never* change (because they directly reflect immutable git data.) If gitweb could produce Last-Modified and Expires headers where appropriate, it should improve caching performance." Kay did this, and Brian Gerst added, "Some other areas for improvement would be to seperate out the git icon and the style sheet into seperate static files."

11. git 0.99.8f Released

Wed Oct 19  (1 post) Archive Link: "GIT 0.99.8f"

People: Junio C. Hamano

Junio C. Hamano announced:

GIT 0.99.8f is available as usual at:

RPMs and tarball: (
Debs and tarball: (

Sorry, I said 0.99.8e was going to be the last 0.99.8 maintenance release, but it turns out that there was a flurry of updates to git-daemon and rev-list (which matters to gitweb) yesterday. So here it is.

Now, this _is_ going to be the last 0.99.8 maintenance release, I promise ;-).

12. Denial Of Service Attacks Against The git Protocol

Wed Oct 19  - Thu Oct 20  (12 posts) Subject: "The git protocol and DoS"

People: H. Peter AnvinJunio C. HamanoLinus Torvalds

On the git mailing list, H. Peter Anvin said:

I've been concerned for a while that the git protocol may be inherently vulnerable to a "SYNful DoS" attack (spraying raw TCP SYN packets with enough data to start substantial server activity.) Although SYN cookies protect against this to some degree, it makes me wonder if something should be added to the protocol itself.

One way to do this would be to start the transaction by having the server transmit a cookie to the client, and to require the client to send a SHA1 of the (cookie + request) together with the request. This would be done with a fairly short timeout.

It would, however, require a protocol change; I would like to hear what people think about this at this stage.

Junio C. Hamano replied:

Well, it is full two days since a majorly visible git protocol enabled server has been announced, and you probably know what kind of hits you are getting (and please let us know if you have numbers, I am curious). If we do a protocol change, earlier the better. You already said that the git is experimental. Does anybody run git daemons and rely on the current protocol?

I suspect it would not make *any* sense to have a backward compatible server that optionally allows this cookie exchange -- attackers can just say "I am an older client". OTOH, it probably makes sense to have an option on the client side to skip the cookie exchange stage. I do not think autodetecting new/old server on the client side in connect.c is possible.

They started to discuss the various possibilities, when Linus Torvalds broke in, with:

Hey guys, I actually planned for the protocol to be extensible.

The client always starts out by sending the "command" first, and if you want to add a challenge-response thing, I really think you should make it a nice compatible upgrade (and then later on, you can have a server option that says "if the client doesn't do the challenge-response version, I won't talk to him").

Basically, right now the client sends a

"git-upload-pack /absolute/pathname/to/repo"

over the protocol, and the whole point of this was that (a) it's extensible and (b) the server knows what to expect, and can close the socket if it doesn't get a valid packet.

So if you add some extra challenge-response thing, please just do so by changing the string. Teach the server to also accept

"git-upload-pack --challenge /absolute/pathname/to/repo"

for example. Then later, add a "secure server" mode that refuses to do the old non-challenge response.

HOWEVER. The server _already_ has some of this logic: if you start it outside of inetd, it will start killing its own children when there are too many of them, but it will start by sending them a SIGTERM. And the git-daemon code is set up so that a SIGTERM will kill any deamon that hasn't seen the proper handshake yet.

Once it's seen the proper handshake, the deamon will block SIGTERM. Exactly so that if there is a SYN attack, people who use a non-git-aware SYN generator will be second-class citizens. So there's not a real challenge-response thing, but at least it's set up so that real git clients (or something that looks like one) can be recognized, and get preferred treatment over people who just open a connection.

Of course, this part doesn't work with the setup, since that uses inetd, but we could easily add a timeout too, and do the same exact thing for SIGALRM (and just do an "alarm(timeout)" at the head of "execute()" before we start really trying to read from the socket).

In other words, git-daemon _already_ has support to help fight SYN attacks, although it currently only works when stand-alone. It could be extended to work with inetd, though.

NOTE! Right now, a git-aware SYN-flooder could send a SYN + "git-upload-pack /valid/directory" thing in the proper packed-line format, and _then_ just go away. But once you're talking to a git-aware SYN-flooder, I don't think a challenge-response makes it any better, since a git-aware SYN-flooder would just be written to give the right response.

So unless you actually have _passwords_, and make the response something that the other end has to figure out some other way, I don't see what else we could do..

To this last point, Junio said, "I think Peter's point is that the one that can give the right response needs to read from the server to compute it, and at that point it is not a "SYN-flooder" anymore." And Peter said, "Right. It has been shown that requiring some effort on the part of the client before the server spends work on it can greatly reduce the capabilities of a limited-resource client to execute a DoS."

13. Attempting To Revise The git Protocol

Wed Oct 19  - Fri Oct 21  (10 posts) Archive Link: "Revamping the git protocol"

People: H. Peter AnvinLinus Torvalds

On the git mailing list, H. Peter Anvin said:

Okay, so I've started thinking about what it would take to revamp the git protocol. What I came up with seems a little complex, but all it really is is take the framework that most successful Internet protocols have used and applied it to git.

Something else that I've noticed is that there is functionality overlap between git-daemon and git-send-pack, such as the namespace management (DWIM functionality.) Additionally, even when using git over ssh there is the potential for version skew, so it might be worthwhile to run the full protocol over ssh as well.

Anyway, here is a strawman. Items I feel unsure about I've put in brackets.


1. "Strings" are sequences of bytes prefixed with a length. The length is encoded as four lower-case hexadecimal digits. [Why not as 2 or 4 bytes of network byte order binary?] When represented in this text as "foo", this means the sequence of bytes on the wire is <0003foo>.

2. Upon connection, the server will issue a sequence of strings, terminated by a null string. The first string will be of the format:

"git <x.y>[ <hostname>]"

x.y is protocol revision (currently 1.0) with the following semantics:

For protocol version 1.0, subsequent strings are of the form:

"<R|O|I> option[ <parameters...>]"

... where the letter indicates REQUIRED, OPTIONAL or INFORMATIVE. If a server specifies a REQUIRED option which the client does not understand or support, the CLIENT should terminate with an "unable" command (see below). An OPTIONAL option is available to the client should it choose to accept it. An INFORMATIVE option has no protocol function, but may be used to tune the client, inform the client of server policies (such as timeouts) or display to the end user if the client is in verbose mode.

Note that the addition of options does not require a new protocol revision. It is generally believed that the protocol revision will rarely, if ever, be changed.

2a. Option "challenge":

"R challenge <seed>"

... where 'seed' is any sequence of bytes means that the client should compute the SHA-1 of the seed and issue a "response" command with the SHA1 in hexadecimal form before issuing any other command.

3. After receiving the list of options, the client can issue commands. Commands are strings beginning with a command, one space, and any arguments as appropriate to the command.

4. The response to a command is a string beginning with a dot-separated sequence of numbers, one space, and an optional human-readable text string. Each part of the dot-separated sequence refines the response; if a client receives " foo" and doesn't know what it is, but knows what a "3.1" response is, it should treat the response as a 3.1 response.

If the server is closing the connection, the response is prefixed with the letter 'C':

"C5.0.1 Incorrect response"

Future versions of the protocol might define new prefix letters; if a client encounters unknown prefix letters they should be ignored.

2       - successful completion, closing connection
3       - successful initiation, begin transaction
4       - transient error
4.1     - server resource exhaustion errors
4.1.1   - load too high
5       - permanent error
5.1     - protocol errors
5.2     - authentication error
5.2.1   - invalid reponse to challenge option
5.3     - permission errors
5.3.1   - repository access denied
5.4     - data integrity error
5.4.1   - invalid or corrupt repository

5. Commands, and their responses:

"response <sha1>"

... response to a "challenge" option. Responses:

"2.0 OK" - response accepted
"C5.2.1 Invalid response" - invalid response

"unable <human error message>"

... error message from the client to the server due to an unsupported R option. Sending this message can inform the server administrator of version skew problems.


"C5.1.1 Too bad"

"send-pack <path>"

... begin synchronization of the repository at <path>. Responses:

"3.1.1 Begin"
Any 4.1 response
Any 5.3 or 5.4 response

Clearly this needs to be fleshed out a bit more... is this total insanity on my part, or is this something worth doing?

Some discussion started up, with Petr Baudis (Cogito maintainer) offering his own suggestions; but Linus Torvalds said:

Neither your not Peter's scheme seems to be at all worried about backwards compatibility, and I just don't see _why_.

Even if you can upgrade all servers (there aren't that many of them), why force a client upgrade when the protocol is designed to be extensible?

Especially for somethign that doesn't even _buy_ you anything right now.

In fact, I'm not even sure it buys you anything in the future. The thing is, SYN-flooding depends on overwhelming you with lots of simple packets. And since in the git protocol, the expense is not in the _packets_ but in the server-side packing and data transfer, I don't see the point.

If you want to DoS a git pack server, you open a hundred _real_ git connections to it, carefully selected so that they get unique packs (so that the server can't cache them). You don't need to have some distributed denial-of-service attack with lots of magic packets.

This is why the git daemon already limits the clients to 25 by default or something like that - it doesn't want to put too much strain on the server.

A much more important thing the git daemon could do is to kill connections from the same IP address when there's more than 25 pending ones. The daemon actualy has the infrastructure for that - it's why it doesn't just count its children, it actually saves child information away (it just doesn't _use_ it for anything right now).

Similarly, git-upload-pack can be future-proofed by having it have some data transfer timeout: if it doesn't make any progress at all in <n> seconds, just kill itself. Things like _that_ are likely to be a lot more important, I suspect.

And no, I don't think th egit protocol should do authentication. It's hard. If you want to do authentication, you need to do encryption too, and then you should do something else (but the git protocol _does_ work fine over an encyrpted channel, so the "something else" might be to have some secure web interface tunnel protocol or similar, and then just support "git over https" or something ;).

14. ktimers Subsystem Update

Fri Oct 21  (1 post) Archive Link: "[ANNOUNCE] [PATCH] ktimers subsystem, reworked"

People: Thomas Gleixner

Thomas Gleixner said:

This is a new, much-cleaned up version of the ktimers subsystem.

We reworked the patch thoroughly and we hope to have addressed all points raised on lkml. Special thanks go to Andrew Morton and Arjan van de Ven for detailed code-review.

The new patch can be downloaded from:



The high-resolution timer combo patch including John Stultz's generic time of day, the clockevents framework is available too, at:

along with the broken out version and

The text below is from Documentation/ktimers.txt, which will hopefully clarify most of the remaining conceptual issues raised on lkml. Comments, reviews, reports welcome!

ktimers - subsystem for high-precision kernel timers

This patch introduces a new subsystem for high-precision kernel timers.

Why two timer subsystems? After a lot of back and forth trying to integrate high-precision and high-resolution features into the existing timer framework, and after testing various such high-resolution timer implementations in practice, we came to the conclusion that the timer wheel code is fundamentally not suitable for such an approach. We initially didnt believe this ('there must be a way to solve this'), and we spent a considerable effort trying to integrate things into the timer wheel, but we failed. There are several reasons why such integration is impossible:

The primary users of precision timers are user-space applications that utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel users like drivers and subsystems with a requirement for precise timed events can benefit from the availability of a seperate high-precision timer subsystem as well.

The ktimer subsystem is easily extended with high-resolution capabilities, and patches for that exist and are maturing quickly. The increasing demand for realtime and multimedia applications along with other potential users for precise timers gives another reason to separate the "timeout" and "precise timer" subsystems.

Another potential benefit is that such seperation allows for future optimizations of the existing timer wheel implementation for the low resolution and low precision use cases - once the precision-sensitive APIs are separated from the timer wheel and are migrated over to ktimers. E.g. we could decrease the frequency of the timeout subsystem from 250 Hz to 100 HZ (or even smaller).

ktimer subsystem implementation details

the basic design considerations were:

From our previous experience with various approaches of high-resolution timers another basic requirement was the immediate enqueueing and ordering of timers at activation time. After looking at several possible solutions such as radix trees and hashes, the red black tree was choosen as the basic data structure. Rbtrees are available as a library in the kernel and are used in various performance-critical areas of e.g. memory management and file systems. The rbtree is solely used for the time sorted ordering, while a seperate list is used to give the expiry code fast access to the queued timers, without having to walk the rbtree. (This seperate list is also useful for high-resolution timers where we need seperate pending and expired queues while keeping the time-order intact.)

The time-ordered enqueueing is not purely for the purposes of the high-resolution timers extension though, it also simplifies the handling of absolute timers based on CLOCK_REALTIME. The existing implementation needed to keep an extra list of all armed absolute CLOCK_REALTIME timers along with complex locking. In case of settimeofday and NTP, all the timers (!) had to be dequeued, the time-changing code had to fix them up one by one, and all of them had to be enqueued again. The time-ordered enqueueing and the storage of the expiry time in absolute time units removes all this complex and poorly scaling code from the posix-timer implementation - the clock can simply be set without having to touch the rbtree. This also makes the handling of posix-timers simpler in general.

The locking and per-CPU behavior of ktimers was mostly taken from the existing timer wheel code, as it is mature and well suited. Sharing code was not really a win, due to the different data structures. Also, the ktimer functions now have clearer behavior and clearer names - such as ktimer_try_to_cancel() and ktimer_cancel() [which are roughly equivalent to del_timer() and del_timer_sync()] - and there's no direct 1:1 mapping between them on the algorithmical level.

The internal representation of time values (ktime_t) is implemented via macros and inline functions, and can be switched between a "hybrid union" type and a plain "scalar" 64bit nanoseconds representation (at compile time). The hybrid union type exists to optimize time conversions on 32bit CPUs. This build-time-selectable ktime_t storage format was implemented to avoid the performance impact of 64-bit multiplications and divisions on 32bit CPUs. Such operations are frequently necessary to convert between the storage formats provided by kernel and userspace interfaces and the internal time format. (See include/linux/ktime.h for further details.)

We used the high-resolution timer subsystem ontop of ktimers to verify the ktimer implementation details in praxis, and we also ran the posix timer tests in order to ensure specification compliance.

The ktimer patch converts the following kernel functionality to use ktimers:

The conversion of nanosleep and posix-timers enabled the unification of nanosleep and clock_nanosleep.

The code was successfully compiled for the following platforms:

i386, x86_64, ARM, PPC, PPC64, IA64

The code was run-tested on the following platforms:

i386(UP/SMP), x86_64(UP/SMP), ARM, PPC

ktimers were also integrated into the -rt tree, along with a ktimers-based high-resolution timer implementation, so the ktimers code got a healthy amount of testing and use in practice.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.