Wednesday, August 1, 2012

Sorting improvements in PostgreSQL 9.2: the case for micro-optimisation

There has been much discussion of performance improvements in the upcoming 9.2 release of PostgreSQL. Recently, I noticed that Regina Obe and Leo Hsu's new book, "PostgreSQL: Up and running" prominently listed "Sorting improvements that improve in-memory sorting operations by as much as 20%" as a performance feature of that release. While they do get things about right there, I'm not sure that this improvement warrants such prominent placement, at least in sheer terms of its likely impact on the performance of production PostgreSQL systems - we packed a lot of great performance improvements into 9.2. The likely reason that it was picked up on in the book, and the real reason for this blogpost, is the story behind the development of the optimisation, which I for one find kind of interesting, and worth sharing. It's more interesting from the perspective of someone with a general interest in systems programming or PostgreSQL's design philosophy than a casual user, though. If you're a casual user, the short version is that simple queries that perform in-memory sorting of integers and floats will be about 23% faster.


I wrote a rough prototype of the patch, that had a number of ideas, and proved the viability of the approach. Principal among those ideas was specialisation of the quicksort code: Formatting the code such that the compiler had compile-time knowledge of functions, with inlining used as an enabling optimisation, and a few variations produced. So rather than using complex indirection involving function pointers, a macro infrastructure was used to generate multiple specialisations, allowing the compiler to optimise the code more effectively as a result of being able to integrate everything.

A secondary problem was that comparators (i.e. the comparison functions that all sorting within Postgres currently needs) were accessed in a round-about away.

Roughly speaking, tuplesort (the part of the code that deals with sorting tuples, perhaps as part of a query's execution, or perhaps as the first step in creating a new index) is very generic code. Code very similar to the C standard library's qsort() is used directly in 9.1 - our particular implementation was lifted from NetBSD, as it happens. This qsort function is passed a function pointer. That function pointer pointed to a "tupleclass encapsulating comparator" (i.e. the comparator differs for "heap" (table) tuples and index tuples). This comparator in turn calls the "comparator proper" for each and every sortkey (i.e. ORDER BY column) to be sorted. The indirection doesn't stop there though.

Postgres comparators ("comparator propers" - the actual, datatype specific functions that Postgres uses for sorting) can be called from SQL. While they are written in C in the case of all built-in types, in 9.1, we still accessed the comparators through the SQL function call machinery, rather than a direct function call (or function call through a function pointer). This isn't ordinarily a big deal - we do use this in some other performance critical codepaths, such as the index access method code - PostgreSQL doesn't know anything about indexes other than what this "abstract interface" tells it, even when interacting with btree indexes. In this way, a module author can define a whole new index type (I don't just mean implementing a new indexing scheme based on GiST/GIN - I mean a whole new index type). Though not a big deal there, in a tight, frequently executed loop, repeatedly calling a comparison that is essentially just one or two CPU instructions, the overhead of this fmgr trampoline can start to matter a lot.

Tom Lane and Robert Haas fixed this problem. My work inspired their development of the SortSupport infrastructure as a first phase/patch towards realising the full potential of these ideas; this is essentially just a way of facilitating datatype authors in providing alternative versions of comparators that can be accessed more directly, through function pointers. Owing to the fact that Postgres has a long tradition of being highly extensible, it was necessary to generalise this entirely, so that even third party module authors could use the infrastructure. It was likely to only be immediately useful for a few built-in datatypes like floats and integers, but it's hard to predict how people may choose to use this - there was some talk of novel applications of sortSupport.

In the latter phase, we worked on producing the actual specialisations of the sort code. For heap tuples, we specialise on single sort keys and multiple sort keys, producing variant quicksort code for each. The "tupleclass encapsulating comparator" (the comparator for heap tuples in this case) is inlined. This is actually the more valuable of the two optimisations, counter-intuitive though that is.

Why should simple inlining make such a large difference? Inlining certainly isn't just about eliminating the function call overhead (though there is that too); in fact, it's generally much more important as an enabling transformation. Inlining may enable loop-invariant code motiondead code elimination, or induction variable elimination. As a broad principle, the more the compiler knows, the greater leeway it has to apply its optimisations. These optimisations are not to be sniffed at - for quicksorting, a simple, isolated test case can show sorting 2.5 - 3 times faster when inlining of comparators occurs, rather than accessing comparators through function pointers (this indirection is necessary to support the interface of the C standard library's qsort())

Just to see where things stand after all of this, I performed this benchmark, where we measure the transactions per second after 45 seconds of executing the same simple ORDER BY query as many times as possible within a single session. I do this both with and without client overhead to see how important a factor that is (basically, no tuples are actually returned when we don't measure the overhead, and yet the underlying execution costs are the same - we do this by appending "OFFSET 10001" to the query).
[peter@peterlaptop tests]$ # no client overhead, 9.1:
[peter@peterlaptop tests]$ pgbench -f sort_no.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 12159
tps = 270.180571 (including connections establishing)
tps = 270.205386 (excluding connections establishing)
[peter@peterlaptop tests]$ # no client overhead, 9.2:
[peter@peterlaptop tests]$ pgbench -f sort_no.sql -T 10
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 3687
tps = 368.683077 (including connections establishing)
tps = 368.890864 (excluding connections establishing)
[peter@peterlaptop tests]$ # client overhead, 9.1:
[peter@peterlaptop tests]$ pgbench -f sort_o.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 7304
tps = 162.301377 (including connections establishing)
tps = 162.316128 (excluding connections establishing)
[peter@peterlaptop tests]$ # client overhead, 9.2:
[peter@peterlaptop tests]$ pgbench -f sort_o.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 8977
tps = 199.470582 (including connections establishing)
tps = 199.494630 (excluding connections establishing)

Boiling that down further:

Postgres 9.1 Postgres 9.2 Difference
No Client overhead 270.205 TPS 368.890 TPS + 36.5%
Client overhead 162.316 TPS 199.494 TPS + 22.9%

So that's an improvement in throughput of about 23% for this sympathetic, though representative query.

To quote Linus Torvalds, "To some degree people say that you should not micro-optimise, but if you love micro-optimisation, that's what you should do". I don't know that I love micro-optimisation, but it can certainly be well worthwhile to perform re-jiggering, or specialisation, or lowering of cache miss rates, at least in the case of an important infrastructure project's innermost loops.

I was happy that I was ultimately able to overcome objections about the possible distributed costs of creating specialisations - this can result in "binary bloat". Each specialisation needs to independently justify the resulting increase in object file size. In general it's very difficult to argue that the trade-off represented by any particular specialisation is generally worth it. There was a protracted discussion of the merits of doing all of this. That was the factor that made it stick out in people's minds, I suspect.

6 comments:

  1. Why do you run the 9.2 no overhead test for only 10 seconds when all other tests are run for 45 seconds?

    ReplyDelete
  2. I guess I ran that test first for 10 seconds, but then decided to run the other tests for 45 seconds. I editorialised for clarity, so this isn't apparent from the order you see the tests in. It's very unlikely to make any appreciable difference though, since a highly CPU-bound workload like this will see very consistent inter-run performance. We're interested in the TPS figure.

    ReplyDelete
  3. Isn't inlining the reason why std::sort is faster in C++?

    ReplyDelete
  4. Yes, it is, and as you might have guessed, I used to write C++ for a living. I think a better way to explain it is that unlike std::sort, qsort() throws away information about the underlying type being sorted, in order to provide its generic interface, where one argument is the sizeof() the elements being sorted.

    ReplyDelete
  5. Why not use radix sort for integers, strings, dates, and other things that have a natural correspondence to integers?

    ReplyDelete
  6. One of the proposed "novel uses" of the SortSupport infrastructure that I refer to was to support non-comparison based sorting, which could be worthwhile in some cases, but isn't possible to do today with the current infrastructure.

    I don't think that radix sort is going to be all that compelling for in-memory sorting of integers, and all of those types that you mention are compared as integers within Postgres. Furthermore, it isn't clear how we could continue to support things like "SELECT...ORDER BY foo NULLS FIRST DESC" with a non-comparison based sort. This is something that happens within what I've called the "tupleclass encapsulating comparator" - as that comparator calls the "comparator proper" for each sort key, the returned values are interpreted through yet another inline function (even in 9.1) that returns a value (-1, 0, 1) that will result in those clauses being respected.

    ReplyDelete