When “slower” doesn’t necessarily mean slower

Last week, when I had a bit of free time (well, as free as time gets when you’re looking for something interesting to do at 1 a.m.), I randomly glanced at a bug report somebody had filed saying that strstr was slow on Mac OS X.

This post isn’t about strstr performance, as interesting as that would be. Instead, it’s about how people (OK, programs) decide that things are slow.

The strstr bug report was filed because someone was trying to build a copy of procmail on Mac OS X. procmail has been around for a long time, and originally it ran on systems far less powerful than the slowest system supported by Mac OS X (a 233MHz iMac, for those keeping score at home). A long time ago (1994, actually, according to the procmail history file), the procmail developers apparently noticed that procmail’s performance was very dependent on the performance of the system’s strstr API. They sat down, wrote their own that they believed to be much faster, and all was good in the world.

They also added a test to their configure script to see if their strstr is faster than the one on the local system. If the local one is faster, they smartly use that one instead. It’s a nice bit of forward thinking — perhaps someday someone would write a better strstr, and they’d be ready.

The configure script prints out something like “Your system’s strstr is 1.20 times SLOWER than ours” or “Your system’s strstr is 1.60 times FASTER than ours”. The bug was filed because the person building procmail noticed that it printed out the “slower” line and figured that we probably needed a better strstr.

So far, so good. A quick glance shows that at least one other system found the procmail work to be useful — glibc’s strstr implementation is copied from procmail.

Wondering what about procmail’s implementation made it faster than Mac OS X’s strstr, I sat down to look at procmail’s configure test to see what it was doing. I figured it was probably doing a series of tests of strings with different lengths, including some search strings that don’t occur in the target at all, some that are at the beginning, some that are at the end, and so on.

Not so. procmail’s test builds a large block consisting of the letter ‘a’ over and over again with a few carriage returns thrown in, and then searches it for the string “From:\n”. It then tries that test a number of times with the system strstr and a single time with the procmail strstr.

Two interesting results jumped out here. First, the test is simply a single variant — a large target string and a short search string whose first letter doesn’t show up anywhere in the target string. That’s basically pointless for most strstr usage, including most of procmail’s typical usage. In fact, it means that about 90% of the code in the procmail strstr implementation isn’t even invoked as part of the test. Second, I was having trouble figuring out how the tests were run so I tried a little trick. I made a second (unmodified) copy of the procmail strstr implementation and modified the script to use that copy instead of the system strstr. The first copy won almost every time. Then I reversed the two, so my copy ran second and the first copy was used instead of the system strstr. Now my copy won almost every time.

In other words, the test doesn’t test anything useful, and even if the results were useful, they’re very dependent on the order in which the strstr implementations are run (at least on Mac OS X on my PowerBook). In other other words, the configure script doesn’t actually tell us which implementation is faster. All I could conclude from all of this was that the test was broken. From this test alone, it’s simply impossible to reach any conclusion about the performance of Mac OS X’s strstr…or of procmail’s strstr.

I guess the takeaway points are these: If you’re doing a test that compares the performance of two algorithms, be sure to use a data set that is at least vaguely representative of what the algorithms would see when run as part of your code. And make sure that your test doesn’t return different results when you change the order in which the algorithms are used. (I’d call that a “stable test”, like a stable sort, but I’m not sure if that’s a term already in use elsewhere.) And if you have a free minute and enjoy reading really complicated C code, go read the procmail strstr implementation. Performance aside, it’s pretty nifty.

1 Comment

  1. Louis Gerbarg Said,

    September 30, 2003 @ 8:46 am

    I have had a few discussions about strstr performance, but only ever in passing. The order issue your seeing probably has to do with cache state of the data. Of course testing the algorithm is tricky, since if you force the data set out of cache the speed will likely be dominated by the cache miss time (depending on how good the processors prefetch engine is). If your data set fits inside your L2 cache I recommend you write a loop to read through it a cacheline at a time to warm the cache, then run a testcase. You still will be susceptible to being contexted switched out during a test and loosing your cache, but if your test is relatively short (a fraction of the scheduling quanta) you should be able to run the test several times and throw out the outliers.

    Making a good routine that works well in all cases is often remarkably more difficult than making one test work well. I have more than once recommended against using a particular function that was supposed to be faster, because it had conditions underwhich it was slower, and the statistical mix showed those conditions were hit often enough that it was a net loss, even though the fast case was faster. In general my rule is to do it the simplest way you can, and if decide its a big enough deal to be clever you need to write a huge amount of performance and validation code. Often actually rebuild the part of the system that contains that routine with extra logging code just so I can record all the calls to it and perform analysis of how it is used. If your fiddling with a core routine and you don’t understand who is calling it and how, you will likely make bad assumptions.

    Yes I have had to do this sort of testing for this sort of routine. Does it show?

    Louis

RSS feed for comments on this post