Recently I found a YouTube video that propagates branchless programming. I’d heard of the concept before; this time I actually had enough time to watch it. If you don’t have time yourself; the author claims that branchless programming can significantly improve the performance of your code.

It’s an interesting claim, as I had mostly considered it a gimmick for programmers to brag about their skills of producing write-only code. Or perhaps this last statement just shows how little I understand programming myself anymore these days.

Here’s the video:

I particularly like how the author uses simple examples that anybody (including yours truly) can understand, even all the way down to the Assembly level.

The catch

Here’s the thing about the video: the author gives all examples in C code. It’s been forever since I did anything with that myself (and I admit that I do suffer from some C-envy).

Most of the stuff I did the last year was done in Python (Jupyter) or PHP (anything web-based).

So my question becomes: is branchless programming a universal concept that can be exploited to improve performance in all programming languages? Or does it only apply for die-hard, (almost) bare metal programming in languages like C (or perhaps Rust)?

Let’s find out

Branchless Python with Jupyter

You can’t get much more high-level than writing Python code in a Jupyter notebook. Well, you can, but then it’s no longer coding, but rather low-coding on something like Microsoft Power Platform.

So let’s start with the simplest example from the video:

int Smaller(int a, int b) {
    if (a < b) return a;
    else return b;
}

Is demonstrably slower than:

int Smaller_Branchless(int a, int b) {
    return a*(a<b)+b*(b<=a);
}

Firing up Jupyter and Python, we get:

def Smaller(a, b):
    if (a < b):
        return a
    else:
        return b

And

def Smaller_Branchless(a, b):
    return a*(a<b) + b*(b<a)

When we loop both functions 1 million times, the first one takes 247 milliseconds to complete; the second one takes 365 milliseconds. In other words: the branchless version takes 48% longer in computation time!

What if we added explicit type-casting to avoid the runtime to have to figure out on its own that we need a bool to int conversion?

def Smaller_Branchless_Typecast(a, b):
    return a*int(a<b) + b*int(b<a)

Any guesses? A million iterations now takes 519 milliseconds; or more than twice our original 247 measurementt. Explicit type-casting is clearly not a good idea.

Here’s a (Matplotlib, of course) visual to put things in perspective:

There’s one more thing I wanted to try: let’s just skip our own code altogether and rely on the (almost) standard numpy.min() function. Here’s the complete cell’s code:

import numpy as np

start4 = time()
for _ in range(1, 1000000):
    np.min([random(), random()])
stop4 = time()
elapsed4 = (stop4 - start4) * 1000

Well, now it’s just getting ridiculous: 6047 milliseconds for 1 million iterations. It makes sense to some extent: numpy.min() can do a lot more than just compare two numbers; it also does a lot more, whether you want to or not. One of the obvious differences is that we’re converting our two scalar variables into a list. Clearly not a good idea. But 24 times slower? Wow, that’s some harsh punishment right there!

What about PHP?

I got curious if the same trend occurs in other languages. What about PHP?

So I wrote the following simple functions:

// branched logic
function is_lower1($a, $b) {
    if ($a < $b) {
        return $a;
    } else {
        return $b;
    }
}

// branchless alternative
function is_lower2($a, $b) {
	return $a*($a<$b) + $b*($b<$a);
}

// branchless with bool-int typecasting
function is_lower3($a, $b) {
	return $a*intval($a<$b) + $b*intval($b<$a);
}

// directly invoke the PHP min() function
min(rand(), rand());

I embedded these into separate .php scripts that I could invoke through Microsoft’s IIS webserver. Right off the bat. As with Python, I invoked each function 1 millions times, But I noticed something interesting: refreshing a page could yield vastly different results. So instead of a simple vertical barchart, I’m presenting the results of 30 x 1 million runs here as a boxplot:

The results are a little more complicated, but nevertheless let us conclude the following:

  • Typecasting doesn’t really slow things down in PHP (in contrast with Python)
  • Branchless programming is not improving matters in PHP either (as is the case in Python)
  • The built-in min() function is the fastest performer this time around!

Python versus PHP

How do Python and PHP stack up against each other then? For that, we can take the means of the PHP boxplot and plot it against the Python observations:

Buyer beware

I’m old-school. I learned programming in GW-BASIC in 1989; I learned assembler somewhere in the mid-nineties, when Ralph Brown’s Interrupt List was still around and we would manually compute how many clock-cycli we would need to execute our (admittedly simple) assembler-scripts.

It irks me that occasionally I run into people that say “oh, programming is just knowing which libraries to use, or looking things up from Stack Overflow“. Add ChatGPT to that list. Yeah right.

The simple experiment in this article I think illustrates that it still matters to pay attention to what you write, how you write it, and first and foremost think about what you’re writing.

In all high-level (interpreted) languages like PHP and Python (the same exercise can be done for Java(Script), Ruby, Rust, Go, Martian…), there are oftentimes multiple ways to come to a solution.

It still pays then to try out the different scenarios then and see which one is fastest. The compiler will not solve all your problems for you.

Now go optimize something in your own code!

By yves

Leave a Reply

Your email address will not be published. Required fields are marked *