ProjectPlan

Plans for optimizing Python
Featured

Updated Dec 15, 2009 by collinw

NB: all cited papers are linked on RelevantPapers. A version of this document is also available in Chinese, though we can't vouch for the translation.

Detailed Plans

Testing and Measurement

Goals

We want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow.

Produce a version of Python at least 5x faster than CPython.
Python application performance should be stable.
Maintain source-level compatibility with CPython applications.
Maintain source-level compatibility with CPython extension modules.
We do not want to maintain a Python implementation forever; we view our work as a branch, not a fork.

Overview

In order to achieve our combination of performance and compatibility goals, we opt to modify CPython, rather than start our own implementation from scratch. In particular, we opt to start working on CPython 2.6.1: Python 2.6 nestles nicely between 2.4/2.5 (which most interesting applications are using) and 3.x (which is the eventual future). Starting from a CPython release allows us to avoid reimplementing a wealth of built-in functions, objects and standard library modules, and allows us to reuse the existing, well-used CPython C extension API. Starting from a 2.x CPython release allows us to more easily migrate existing applications; if we were to start with 3.x, and ask large application maintainers to first port their application, we feel this would be a non-starter for our intended audience.

The majority of our work will focus on speeding the execution of Python code, while spending comparatively little time on the Python runtime library. Our long-term proposal is to supplement CPython's custom virtual machine with a JIT built on top of LLVM, while leaving the rest of the Python runtime relatively intact. We have observed that Python applications spend a large portion of their time in the main eval loop. In particular, even relatively minor changes to VM components such as opcode dispatch have a significant effect on Python application performance. We believe that compiling Python to machine code via LLVM's JIT engine will deliver large performance benefits.

Some of the obvious benefits:

Using a JIT will also allow us to move Python from a stack-based machine to a register machine, which has been shown to improve performance in other similar languages (Ierusalimschy et al, 2005; Shi et al, 2005).

Eliminating the need to fetch and dispatch opcodes should alone be a win, even if we do nothing else. See http://bugs.python.org/issue4753 for a discussion of CPython's current sensitivity to opcode dispatch changes.

The current CPython VM opcode fetch/dispatch overhead makes implementing additional optimizations prohibitive. For example, we would like to implement type feedback and dynamic recompilation ala SELF-93 (Hölzle, Chambers and Ungar, 1992), but we feel that implementing the polymorphic inline caches in terms of CPython bytecode would be unacceptably slow.

LLVM in particular is interesting because of its easy-to-use codegen available for multiple platforms and its ability to compile C and C++ to the same intermediate representation we'll be targeting with Python. This will allows us to do inlining and analysis across what is currently a Python/C language barrier.

With the infrastructure to generate machine code comes the possibility of compiling Python into a much more efficient implementation than what would be possible in the current bytecode-based representation. For example, take the snippet

for i in range(3):
  foo(i)

This currently desugars to something like

$x = range(3)
while True:
  try:
    $y = $x.next()
  except StopIteration:
    break
  i = $y
  foo(i)

Once we have a mechanism to know that range() means the range() builtin function, we can turn this into something more akin to

for (i = 0; i < 3; i++)
  foo(i)

in C, possibly using unboxed types for the math. We can then unroll the loop to yield

foo(0)
foo(1)
foo(2)

We intend to structure Unladen Swallow's internals to assume that multiple cores are available for our use. Servers are only going to acquire more and more cores, and we want to exploit that to do more and more work in parallel. For example, we would like to have a concurrent code optimizer that applies increasingly-expensive (and -beneficial!) optimizations in parallel with code execution, using another core to do the work. We are also considering a concurrent garbage collector that would, again, utilize additional cores to offload work units. Since most production server machines are shipping with between 4 and 32 cores, we believe this avenue of optimization is potentially lucrative. However, we will have to be sensitive to the needs of highly-parallel applications and not consume extra cores blindly.

Note that many of the areas we will need to address have been considered and developed by the other dynamic language implementations like MacRuby, JRuby, Rubinius and Parrot, and in particular other Python implementations like Jython, PyPy, and IronPython. In particular, we're looking at these other implementations for ideas on debug information, regex performance ideas, and generally useful performance ideas for dynamic languages. This is all fairly well-trodden ground, and we want to avoid reinventing the wheel as much as possible.

Milestones

Unladen Swallow will be released every three months, with bugfix releases in between as necessary.

2009 Q1

Q1 will be spent making relatively minor tweaks to the existing CPython implementation. We aim for a 25-35% performance improvement over our baseline. Our goals for this quarter are conservative, and are aimed at delivering tangible performance benefits to client applications as soon as possible, that is, without waiting until the completion of the project.

Ideas for achieving this goal:

Re-implement the eval loop in terms of vmgen.
Experiment with compiler options such as 64 bits, LLVM's LTO support, and gcc 4.4's FDO support.
Replace rarely-used opcodes with functions, saving critical code space.
Improve GC performance (see http://bugs.python.org/issue4074).
Improve cPickle performance. Many large websites use this heavily for interacting with memcache.
Simplify frame objects to make frame alloc/dealloc faster.
Implement one of the several proposed schemes for speeding lookups of globals and builtins.

The 2009Q1 release can be found in the release-2009Q1-maint branch. See Release2009Q1 for our performance relative to CPython 2.6.1.

2009 Q2

Q2 focused on supplementing the Python VM with a functionally-equivalent implementation in terms of LLVM. We anticipated some performance improvement, but that was not the primary focus of the 2009Q2 release; the focus was just on getting something working on top of LLVM. Making it faster will come in subsequent quarters.

Goals:

Addition of an LLVM-based JIT.
All tests in the standard library regression suite pass when run via the JIT.
Source compatibility maintained with existing applications, C extension modules.
10% performance improvement.
Stretch goal: 25% performance improvement.

The 2009Q2 release can be found in the release-2009Q2-maint branch. See Release2009Q2 for our performance relative to 2009Q1.

2009 Q3

Q3's development process did not go as originally expected. We had planned that, with the Python->LLVM JIT compiler in place, we could begin optimizing aggressively, exploiting type feedback in all sorts of wonderful ways. That proved somewhat optimistic. We found serious deficiencies in LLVM's just-in-time infrastructure that required a major detour away from our earlier, performance-centric goals. The two most serious problems were, a) a hard limit of 16MB of machine code over the lifetime of the process, and b) a bug in LLVM's x86-64 code generation that led to difficult-to-diagnose segfaults (LLVM PR5201).

Given those obstacles, we were relatively happy with the outcome of the 2009Q3 release:

Unladen Swallow 2009Q3 uses up to 930% less memory than the 2009Q2 release.
Execution performance has improved by 15-70%, depending on benchmark.
Unladen Swallow 2009Q3 integrates with gdb 7.0 to better support debugging of JIT-compiled code.
Unladen Swallow 2009Q3 integrates with OProfile 0.9.4 and later to provide seemless profiling across Python and C code, if configured with --with-oprofile=<oprofile-prefix>.
Many bugs and restrictions in LLVM's JIT have been fixed. In particular, the 2009Q2 limitation of 16MB of machine code has been lifted.
Unladen Swallow 2009Q3 passes the tests for all the third-party tools and libraries listed on the Testing page. Significantly for many projects, this includes compatibility with Twisted, Django, NumPy and Swig.

The 2009Q3 release can be found in the release-2009Q3-maint branch. See Release2009Q3 for our performance relative to 2009Q2. Hearty congratulations go out to our intern, Reid Kleckner, for his vital contributions to the Q3 release.

2009 Q4

Based on the relative immaturity of LLVM's just-in-time infrastructure, we anticipate spending more time fixing fundamental problems with this area of LLVM. (We note that the rest of LLVM is a paradise by comparison.) We anticipate a modest performance increase over Q3, though most of our time will go toward ensuring a high-quality, ultra-stable product so that we have a solid footing for merger with CPython in 2010. We are choosing to shift our focus in this way -- stability now, performance later -- so that all our necessary changes can be incorporated into LLVM 2.7, which will then form the baseline for an LLVM-based CPython 3.x.

Areas for performance improvement (non-exhaustive):

Binary operations via type feedback.
Attribute/method lookup via type feedback.
Moving compilation to a non-block background thread.
Import optimizations (surprisingly important to some benchmarks).

We intend to tag the 2009Q4 release in early January 2010 (to allow for holidays in the United States).

Long-Term Plans

The plan for Q3 onwards is to simply iterate over the literature. We aspire to do no original work, instead using as much of the last 30 years of research as possible. See RelevantPapers for a partial (by no means complete) list of the papers we plan to implement or draw upon.

We plan to address performance considerations in the regular expression engine, as well as any other extension modules found to be bottlenecks. However, regular expressions are already known to be a good target for our work and will be considered first for optimization.

Our long-term goal is to make Python fast enough to start moving performance-important types and functions from C back to Python.

Global Interpreter Lock

From an earlier draft of this document:

In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001).

Our original plans for dealing with the GIL centered around Recycler, a garbage collection scheme proposed by IBM researchers. This appeared at first blush to be an excellent match for Python, and we were excited about prospects for success. Further investigation and private communications revealed that safethread, an experimental branch of CPython, had also implemented Recycler, but with minimal success. The author relayed that he had demonstrated a speed-up on a dual-core system, but performance degraded sharply at four cores.

Accordingly, we are no longer as optimistic about our chances of removing the GIL completely. We now favor a more incremental approach improving the shortcomings pointed out by Dave Beazley, among others. In any case, work on the GIL should be done directly in mainline CPython, or on a very close branch of Python 3.x: the sensitive nature of the work recommends a minimal delta, and doing the work and then porting it from 2.x to 3.y (as would be the case for Unladen Swallow) is a sure-fire way of introducing exceedingly-subtle bugs.

Longer-term, we believe that CPython should drop reference counting and move to a pure garbage collection system. There is a large volume of classic literature and on-going research into garbage collection that could be more effectively exploited in a pure-GC system. Even without the GIL, the current volume of refcount update operations will make scaling effectively to many-core systems a challenge; we believe a pure garbage collection scheme would alleviate these pressures somewhat.

Detailed Plans

JIT Compilation

We plan to start with a simple, easy-to-implement JIT compiler, then add complexity and sophistication as warranted. We will start by implementing a simple, function-at-a-time compiler that takes the CPython bytecode and converts it to machine code via LLVM's internal intermediate representation (IR). The initial implementation of this bytecode-to-machine code compiler will be done by translating the code in CPython's interpreter loop to calls to LLVM's IRBuilder. We will apply only a small subset of LLVM's available optimization passes (current passes), since it's not clear that, for unoptimized Python IR, the extra compilation time spent in the optimizers will be compensated with increased execution performance. We've experimented with using LLVM's fast instruction selector (FastISel), but it fails over to the default, slower instruction selection DAG with our current IR; we may revisit this in the future.

We've chosen a function-at-a-time JIT instead of a tracing JIT (Gal et al, 2007) because we believe whole-function compilation is easier to implement, given the existing structure of CPython bytecode. That is not to say we are opposed to a tracing JIT; on the contrary, implementing a whole-function compiler will provide a large amount of the necessary infrastructure for a tracing JIT, and a whole-function JIT will serve as a valuable baseline for comparing the performance of any proposed tracing implementations. LLVM's analysis libraries already have some support for specially optimizing hot traces (in Trace.cpp) that we may be able to take advantage of if we pursue a tracing JIT. We can get some of the benefits from having the instrumentation for the planned feedback-directed optimization system track taken/not-taken branches and pruning the not-taken branches from the generated machine code.

Since we only wish to spend time optimizing code that will benefit the most -- the program's hot spots -- we will need a way to model hotness. Like in the rest of the JIT compiler, we plan to start simple and add complexity and sophistication as the benchmark results warrant. Our initial model will be very simple: if a function is called 10000 times, it is considered hot and is sent to LLVM for compilation to machine code. This model is obviously deficient, but will serve as a baseline for improvement. Enhancements we're considering: using interpreter loop ticks instead of function calls; aggregating the hotness level of leaf functions up the call tree. In the case of long-running loops, we probably won't try to replace those loops mid-execution, though it should be possible to do this if we desire.

Even at its most naive, the generated machine code differs from its analogue in the interpreter loop in important ways:

The machine code does not support the line-level tracing used by Python's pdb module and others. If the machine code detects that tracing has turned on, it will bail to the interpreter loop, which picks up execution at the same place the machine code left off.
In the machine code, thread switching and signal handling are done at function calls and loop backedges (see r699), rather than every 100ish opcodes as in the interpreter loops. There's plenty of room to optimize this: we can eliminate thread switching/signal handling code on loop backedges where the loop contains function calls, or where we have multiple function calls back-to-back. Support threads and signals in the machine code imposes fairly low overhead (2-4%), so we probably won't start optimizing this unless we find that it inhibits additional optimizations.

In the initial draft of the JIT compiler, we will block execution while compiling hot functions. This is expensive. We will shift compilation to a separate worker thread using FIFO work queues. This will include instrumentation for work unit throughput, execution pause times, and temporal clustering so that we can measure the impact on execution time. Rubinius developed this technique independently, has seen success using it to reduce pause times in execution. We believe this strategy will allow us to perform potentially more expensive optimizations than if we had to always block execution on compilation/optimization (ala Self). This work is being tracked in issue 40.

Feedback-Directed Optimization

We wish to make use of the wealth of information available at runtime to optimize Python programs. Sophisticated implementations of Self, Java and JavaScript have demonstrated the real-world applicability of these techniques, and Psyco has demonstrated some applicability to Python. Accordingly, we believe it will be profitable to make use of runtime information when optimizing Unladen Swallow's native code generation.

Background

Unladen Swallow compiles Python code to a bytecode representation that is amenable to reasonably-performant execution in a custom virtual machine. Once a piece of code has been deemed hot, we compile the bytecode to native code using LLVM. It is this native code that we seek to optimize. Optimizing the execution of the generated bytecode is less interesting since the system should be selecting most performance-critical functions for compilation to native code. However, modifications to the bytecode to enable easier compilation/profiling are fair game.

Points

We wish to gather as much information at runtime as possible, not merely type information (as implied by the more specific name "type feedback"). The representation used should allow for sufficient flexibility to record function pointer addresses, branch-taken statistics, etc, potentially any and every piece of information available at runtime.
The gathered information should live in the code object so that it lasts as long as the relevant bytecode.
Hölzle 1994 includes an analysis showing that the vast majority of Self call sites are monomorphic (see section 3). Recent analysis of Ruby programs has observed a similar distribution of call site arity in Ruby. We believe that Python programs are sufficiently similar to Self and Ruby in this regard. Based on these findings from other languages, we will want to limit our optimization attempts to call sites with arity < 3. Our implementation of feedback-directed optimization should gather enough data to conduct a similar analysis for Python.
Due to the nature of Python's bytecode format, we believe it would be unprofitable to implement the desired level of data gathering inline, that is, as separate opcodes. Instead, the bodies of interesting opcodes in the Python VM will be modified to record the data they use. This will be faster (avoids opcode fetch/dispatch overhead) and easier to reason about (no need to track multi-byte opcode sequences).
We will optimize for the common case, as determined by the data-gathering probes we will add to interesting opcodes. Guards will detect the uncommon case and fail back to the interpreter. This allows our assumptions about the common case to propagate to later code. Again, see Hölzle 1994.

The initial round of data-gathering infrastructure (for types and branches) was added in r778.

Places we would like to optimize (non-exhaustive)

Operators. If both operands are ints, we would like to inline the math operations into the generated machine code, rather than going through indirect calls. If both operands are strings, we would like to call directly to the appropriate function (possibly compiled with Clang and using the fastcc calling convention) rather than going through the indirection in PyNumber_Add(). This optimization is being tracked in issue 73.
UNPACK_SEQUENCE. Knowing the type of the sequence being unpacked could allow us to inline the subsequent STORE_FOO opcodes and avoid a lot of stack manipulation.
Calls to builtins. We would like to be able to inline calls to builtin functions, or at the very least, avoid looking the function up via LOAD_GLOBAL. Ideally we would also be able to inline some of these calls where that is deemed profitable. For example, inlining len() could save not only the LOAD_GLOBAL lookup but also the layers of indirection incurred in PyObject_Size(). In the best case, a call to len() on lists or tuples (or other builtin types) could be turned into ((PyVarObject *)(ob))->ob_size. This optimization is being tracked in issue 67 (LOAD_GLOBAL improvements) and issue 75 (inlining simple builtins).
Branches. If a branch is always taken in a given direction, we can omit the machine code for the uncommon case, falling back to the interpreter instead. This can be used to simplify the control-flow graph and thus allow greater optimization of the common case and the code that follows it. This optimization is being tracked in issue 72.
Method dispatch. If we know the most-likely receiver types for a given method invocation, we can potentially avoid the method lookup overhead or inline the call entirely. Note that in Python 2.6 and higher, method lookups are cached in the type objects so the potential savings of skipping some steps in the cache check process may be minimal. Better to reuse this information for possible inlining efforts.
Function calls. If we know the parameter signature of the function being invoked, we can avoid the overhead of taking the arguments and matching them up with the formal parameters. This logic can be fairly expensive, since it is designed to be as general as possible to support the wide variety of legal function call/definition combinations. If we don't need to be so general, we can be faster. This optimization is being tracked in issue 74.

Regular Expressions

While regexes aren't a traditional performance hotspot, we've found that most regex users expect them to be faster than they are, resulting in surprising performance degradation. For this reason, we'd like to invest some resources in speeding up CPython's regex engine.

CPython's current regex engine is a stack-based bytecode interpreter. It does not take advantage of any form of modern techniques to improve opcode dispatch performance (Bell 1973; Ertl & Gregg, 2003; Berndl et al, 2005) and is in other respects a traditional, straightforward virtual machine. We believe that many of the techniques being applied to speed up pure-Python performance are equally applicable to regex performance, starting at improved opcode dispatch all the way through JIT-compiling regexes down to machine code.

Recent work in the Javascript community has confirmed our belief. Google's V8 engine now includes Irregexp, a JIT regex compiler, and the new SquirrelFish Extreme includes a new regex engine based on the same principle: trade JIT compilation time for execution time. Both of these show impressive gains on the regex section of the various Javascript benchmarks. We would like to replicate these results for CPython.

We also considered using Thompson NFAs for very simple regexes, as advocated by Russ Cox. This would create a multi-engine regex system that could choose the fastest way of implementation any given pattern. The V8 team also considered such a hybrid system when working on Irregexp but rejected it, saying

The problem we ran into is that not only backreferences but also basic operators like | and * are defined in terms of backtracking. To get the right behavior you may need backtracking even for seemingly simple regexps. Based on the data we have for how regexps are used on the web and considering the optimizations we already had in place we decided that the subset of regexps that would benefit from this was too small.

One problem that needs to be overcome before any work on the CPython regex engine begins is that Python lacks a regex benchmark suite. We might be able to reuse the regexp.js component of the V8 benchmarks, but we would first need to verify that these are representative of the kind of regular expressions written in Python programs. We have no reason to believe that regexes used in Python programs differ significantly from those written in Javascript, Ruby, Perl, etc programs, but we would still need to be sure.

Start-up Time

In talking to a number of heavy Python users, we've gotten a lot of interest in improving Python's start-up time. This comes from both very large-scale websites (who want faster server restart times) and from authors of command-line tools (where Python start time might dwarf the actual work done).

Start-up time is currently dominated by imports, especially for large applications like Bazaar. Python offers a lot of flexibility by deferring imports to runtime and providing a lot of hooks for configuring exactly how imports will work and where modules can be imported from. The price for that flexibility is slower imports.

For large users that don't take advantage of that flexibility -- in particular servers, where imports shouldn't change between restarts -- we might provide a way to opt in to stricter, faster import semantics. One idea is to ship all required modules in a single, self-contained "binary". This would both a) avoid multiple filesystem calls for each import, and b) open up the possibility of Python-level link-time optimization, resulting in faster code via inter-module inlining and similar optimizations. Self-contained images like this would be especially attractive for large Python users in the server application space, where hermetic builds and deployments are already considered essential.

A less invasive optimization would be to speed up Python's marshal module, which is used for .pyc and .pyo files. Based on Unladen Swallow's work speeding up cPickle, similarly low-hanging fruit probably exists in marshal.

We already have benchmarks tracking start-up time in a number of configurations. We will probably also add microbenchmarks focusing specifically on imports, since imports currently dominate CPython start time.

Testing and Measurement

Performance

Unladen Swallow maintains a directory of interesting performance tests under the tests directory. perf.py is the main interface to the benchmarks we care about, and will take care of priming runs, clearing *.py[co] files and running interesting statistics over the results.

Unladen Swallow's benchmark suite is focused on the hot spots in major Python applications, in particular web applications. The major web applications we have surveyed have indicated that they bottleneck primarily on template systems, and hence our initial benchmark suite focuses on them:

Django and Spitfire templates. Two very different ways of implementing a template language.
2to3. Translates Python 2 syntax to Python 3. Has an interesting, pure-Python kernel that makes heavy use of objects and method dispatch.
Pickling and unpickling. Large-scale web applications rely on memcache, which in turns uses Python's Pickle format for serialization.

There are also a number of microbenchmarks, for example, an N-Queens solver, an alphametics solver and several start-up time benchmarks.

Apart from these, our benchmark suite includes several crap benchmarks like Richards, PyStone and PyBench; these are only included for completeness and comparison with other Python implementations, which have tended to use them. Unladen Swallow does not consider these benchmarks to be representative of real Python applications or Python implementation performance, and does not run them by default or make decisions based on them.

For charting the long-term performance trend of the project, Unladen Swallow makes use of Google's standard internal performance measurement framework. Project members will post regular performance updates to the mailing lists. For testing individual changes, however, using perf.py as described on the Benchmarks page is sufficient.

Correctness

In order to ensure correctness of the implementation, Unladen Swallow uses both the standard Python test suite, plus a number of third-party libraries that are known-good on Python 2.6. In particular, we test third-party C extension modules, since these are the easiest to break via unwitting changes at the C level.

As work on the JIT implementation moves forward, we will incorporate a fuzzer into our regular test run. We plan to reuse Victor Stinner's Fusil Python fuzzer as much as possible, since it a) exists, and b) has been demonstrated to find real bugs in Python.

Unladen Swallow will come with a --jit option that can be used to control when the JIT kicks in. For example, --jit=never would disable the JIT entirely, while --jit=always would skip the warm-up interpreted executions and jump straight into native code generation; --jit=once will disable recompilation. These options will be used to test the various execution strategies in isolation. Our goal is to avoid JIT bugs that are never encountered because the buggy function isn't hot enough, as have been observed in the JVM (likewise for the interpreted mode).

Unladen Swallow maintains a BuildBot instance that runs the above tests against every commit to trunk.

Complexity

One of CPython's virtues is its simplicity: modifying CPython's VM and compiler is relatively simple and straight-forward. Our work with LLVM will inevitably introduce more complexity into CPython's implementation. In order to measure the productivity trade-offs that may result from this extra machinery, the Unladen Swallow team will periodically take ideas from the python-dev and python-ideas mailing lists and implement them. If implementation is significantly more difficult that the corresponding change to CPython, that's obviously something that we'll need to address before merger. We may also get non-team members to do the implementations so that we get a less biased perspective.

Risks

May not be able to merge back into mainline. There are vocal, conservative senior members of the Python core development community who may oppose the merger of our work, since it will represent such a significant change. This is a good thing! Resistance to change can be very healthy in situations like this, as it will force a thorough, public examination of our patches and their possible long-term impact on the maintenance of CPython -- this is open source, and another set of eyes is always welcome. We believe we can justify the changes we're proposing, and by keeping in close coordination with Guido and other senior members of the community we hope to limit our work to only changes that have a good chance of being accepted. However: there is still the chance that some patches will be rejected. Accordingly, we may be stuck supporting a de facto separate implementation of Python, or as a compromise, not being as fast as we'd like. C'est la vie.
LLVM comes with a lot of unknowns: Impact on extension modules? JIT behaviour in multithreaded apps? Impact on Python start-up time?
Windows support: CPython currently has good Windows support, and we'll have to maintain that in order for our patches to be merged into mainline. Since none of the Unladen Swallow engineers have any/much Windows experience or even Windows machines, keeping Windows support at an acceptable level may slow down our forward progress or force us to disable some performance-beneficial code on Windows. Community contributions may be able to help with this.
Specialized platforms: CPython currently runs on a wide range of hardware and software platforms, from big iron server machines down to Nokia phones. We would like to maintain that kind of hard-won portability and flexibility as much as possible. We already know that LLVM (or even a hand-written JIT compiler) will increase memory usage and Python's binary footprint, possibly to a degree that makes it prohibitive to run Unladen Swallow on previously-support platforms. To mitigate this risk, Unladen Swallow will include a ./configure flag to disable LLVM integration entirely and forcing the use of the traditional eval loop.

Lessons Learned

This section attempts to list the ways that our plans have changed as work has progressed, as we've read more papers and talked to more people. This list is incomplete and will only grow.

Early in our planning, we had considered completely removing the custom CPython virtual machine and replacing it with LLVM. The benefit would be that there would be less code to maintain, and only a single encoding of Python's semantics. The theory was that we could either a) generate slow-to-run but fast-to-compile machine code, or b) generate LLVM IR and run it through LLVM's interpreter. Both of these turned out to be impractical: even at its fastest (using no optimization passes and LLVM's FastISel instruction selector), compiling to native code was too slow. LLVM's IR interpreter was both too slow and did not support all of LLVM's IR. Preserving the CPython VM also allows Unladen to keep compatibility with the unfortunate number of Python packages that parse CPython bytecode.

There are a number of Python programs that parse or otherwise interact with CPython bytecode. Since the exact opcode set, semantics and layout are considered an implementation detail of CPython, we were inclined to disregard any breakage we may inflict on these packages (a pox upon their houses, etc). However, some packages that are too important to break deal with CPython bytecode, among them Twisted and setuptools. This has forced us to be more cautious than we would otherwise like when changing the bytecode.

Initially we generated LLVM IR at the same time as the Python bytecode, using hooks from the Python compiler. The idea was that we would move to generating the LLVM IR from the AST instead, allowing us to generate more efficient IR. The overhead of generating LLVM IR and the decision not to get rid of the CPython VM pre-empted that move, however, and instead we now generate LLVM IR from the Python bytecode. Besides allowing us to generate LLVM IR for any code object (including hand-crafted ones) and not requiring us to keep the AST for a codeblock around, it also means the existing (bytecode) peephole optimizer ends up optimizing the bytecode before it is turned into LLVM IR. The peephole optimizer, with its intimate knowledge of Python semantics, does a good job optimizing things that would be a lot harder to do in LLVM optimization passes.

Communication

All communication about Unladen Swallow should take place on the Unladen Swallow list. This is where design issues will be discussed, as well as notifications of continuous build results, performance numbers, code reviews, and all the other details of an open-source project. If you add a comment below, on this page, we'll probably miss it. Sorry. Please mail our list instead.

Comment by chinbill...@gmail.com, Mar 25, 2009

Why not apply this work to Python 3? If you are successful it may encourage the adoption of Python 3 more quickly (and you obviously won't have to port this work to Python 3). --Nick

Comment by project member collinw, Mar 25, 2009

chinbillybilbo: the applications we are trying to speed up all use Python 2.x. If we required all these applications to first port to Python 3 in order to get any performance benefit, we feel that would be a non-starter for the applications we're focusing on.

Comment by ronaldou...@mac.com, Mar 26, 2009

Very interesting.

W.r.t. the GC: Apple released their GC under the Apache license (http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/). Its finalizer semantics seem slightly incompatibel with that of Python, in that a finalizer may not revive an object, but it might be possible to work around that.

Comment by chphilli, Mar 26, 2009

Reading through all of this got me really excited about this project, and then I read the killer: "Will probably kill Python Windows support (for now).". For integration reasons, my company is currently stuck on Windows!

Anyways, I'm still excited about this project! Making Python faster can only be good for the community!

Comment by karl.1mi...@gmail.com, Mar 26, 2009

the creator of ruby is twitting this http://twitter.com/yukihiro_matz

Comment by teja...@yahoo.com, Mar 26, 2009

Today's ars technica article http://arstechnica.com/open-source/news/2009/03/google-launches-project-to-boost-python-performance-by-5x.ars calls this a Google project. But I see no such claim, just the fact that you have access to Google's performance measurement framework. And of course, anyone can start a code.google.com project. Can you clarify for those curious?

Comment by jason....@gmail.com, Mar 26, 2009

I think a mainline branch of Python will need to continue to provide an interpreter, at least as a compile-time option. You can't beat plain old C for portability, and LLVM has a non-trivial memory footprint (if you load up all the optimization modules) which might not work so well for embedded systems.

Maintaining the C API compatibility will also be an interesting problem. Removing the GIL, creating an advanced garbage collector, and moving away from reference counting will probably go hand-in-hand, and the refcount-oriented single-threaded C API is not going to make things easy. You may want to leave open the possibility of a designing new C API (sans refcounting) while supporting the legacy API using proxy objects or object handles -- you can't have a copying/compacting GC if C code has direct pointers to Python objects.

Comment by 75.blueb...@gmail.com, Mar 27, 2009

Is it possible to have a big of background on who is financing the project, history of the persons doing it ?

The plans looks very real but it sounds strange to see a "let's make python 5x faster in less than one year", without even Guido participating.

Comment by johndenn...@gmail.com, Mar 27, 2009

Here at Red Hat we use Python for a lot of things. What we've observed is that execution performance is not the main issue (although it improving it would be greatly appreciated), rather it's the memory footprint which is the problem we most often encounter. If anything can be done to reduce the massive amount of memory Python uses it would be a huge win. I would encourage you to consider memory usage as just as important a goal as execution speed if you're going to tackle optimizing CPython.

Comment by project member collinw, Mar 27, 2009

This project is Google-financed, but not Google-owned. The two engineers working on this are full-time Google engineers in the compiler optimization team, working on this as their main project. We have other Googlers contributing patches in their 20% time.

Despite that, this is not Google's property. We are pushing changes upstream as quickly as we can, some of which are already in CPython mainline. Google cares a great deal about performance, and because we realize that other people do too, we want to contribute our work back to the open-source world so that everyone can reap the benefit.

Comment by ngri...@gmail.com, Mar 27, 2009

This is wonderful! I was hoping Google will start such a project, since Python is one of Google's official languages. I'll be following this during the next months. A big thanks for sharing your work!

Comment by virtualc...@gmail.com, Mar 27, 2009

Awesome project. LLVM is great for VM's and Python need this. Apple use LLVM for their OS architecture and graphics with OpenGL. It could be that Python will replace Java in the future.

Comment by dreaming...@gmail.com, Mar 27, 2009

There was another "Python3k" project to use a 64-bit architecture on top of the Apache Portable Runtime for the VM called Prothon. Check out this comp.lang.python thread:

http://groups.google.com/group/comp.lang.python.announce/browse_thread/thread/1e6ebaa7b2c98994/acb0e1edb2ca449a?lnk=st&q=comp.lang.python+prothon#acb0e1edb2ca449a

(or google? search: comp.lang.python prothon hahn collins)

...also check out this follow up which includes suggestions that should have made it into python3k but didn't:

http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-03/4822.html (search: comp.lang.python prothon hahn zipher python3k)

Comment by ctis...@gmail.com, Mar 30, 2009

frikker: I think I wasn't sarcastic, but upset, because the project seemed to completely ignore PyPy?. But I wish them luck, too. Would love to see more collaboration between projects, actually.

Comment by a.bad...@gmail.com, Apr 3, 2009

I strongly second the need for a smaller memory footprint. This is especially problematic on x86_64 hosts where python typically needs 2x the RAM of the ix86 interpreter to run an application. In contrast, C applications seem to take 1.25x the RAM on x86_64.

Comment by connelly...@gmail.com, Apr 3, 2009

It's good to see Python optimization efforts that focus on compatibility, specifically with CPython extensions. The language forks such as PyPy?, Stackless, Jython, IronPython?, PyVM, etc I'm sure are useful to some people, but a large class of Python users relies on CPython extensions.

Comment by giovanni...@gmail.com, Apr 5, 2009

Shameless plug: there will be an Unladen Swallow talk at PyCon? Italy: http://www.pycon.it/conference/talks/improving-air-speed-velocity-python

PyCon? Italy will see 400 people attending and real-time ita->eng translations for the main track (so speaking Italian is not a showstopper to attend!) http://www.pycon.it/pycon3/non-italians/

The whole website is in English too. Early bird is still open.

Comment by kirill.k...@gmail.com, Apr 17, 2009

Which exactly benefits do you get of using LLVM instead of libJIT? libJIT has already been tested in Portable.NET JIT, it reaches and beats Mono JIT performance. Is Python that different from .NET and Common Intermediate Language?

Comment by m113...@gmail.com, Apr 20, 2009

Regarding concerns about mem usage, current and future: Are there thoughts about splitting the runtime into a lightweight (handheld) client and a heavyweight (Google) backend?

Comment by hrfe...@gmail.com, May 1, 2009

chphilli: I don't see where they are going to 'kill Python Windows support for now'. Page changed? That would be a bummer because as of May 09, most kids have windows still, and python is a great language for learning. It'd be a shame to lose that demographic. Amongst the massive list of technical challenges, navigation of competing demands, and admirable goals, saying "oh we don't even have windows machines" sticks out as kind of a bubble-ish 1337 cop-out. I call you on it out of love.

Comment by svetli...@gmail.com, Aug 12, 2009

hrfeels: don't worry! i'm convinced the official python implementation will continue to support all the major platforms including windows. what they meant is that their currently developed optimized implementation of python is not windows compatible(or at least not tested)

Comment by gianni...@gmail.com, Jan 29, 2010

it sounds a great idea Why don't you make a contenst or something in order to attract more people to work in the project. All this will be great when it is ready

Comment by wayne.da...@gmail.com, Mar 1, 2010

A for Awesome.

Sounds complicated stuff. The project plan oozes enthusiasm, brilliant. Hope it's a success!

Comment by elff...@gmail.com, Mar 3, 2010

Q4 was scheduled for January, now is March.. if something is delayed or are changes in plan please inform the community

Comment by spyro...@gmail.com, Mar 4, 2010

effikk,

Q4 was released back in January, you can find it here: http://code.google.com/p/unladen-swallow/source/browse/#svn/branches/release-2009Q4-maint

It was a silent, unannounced release. For additional information consult the google groups and python mailing list. (For the record, I am not a dev of or a contributor to this project - just a dedicated python user).

Comment by leon.mat...@gmail.com, Mar 14, 2010

Does the recently announced RE2 regular expression library (http://code.google.com/p/re2/) from Google figure into future plans for speeding up the regex engine?

We use Python regexes to parse a lot of little log files, and would love a speed bump. If it meant a significant performance boost, I could learn to love live without back-references... :-)

Comment by garen.pa...@gmail.com, Apr 25, 2010

What happened to 2009 Q4?

Comment by gbatm...@gmail.com, Apr 28, 2010

What is the state of this ambitious project at this time? We want news! Please!

Comment by j0gat...@gmail.com, May 22, 2010

Would love to hear more about the status of this terrific project. It seems to have gone a bit silent - has Google reduced commitment or funding?

Comment by khame...@gmail.com, Jun 26, 2010

Gurus, it is a pity to have such a poor out-of-date plan and home page for such a huge project. This plan is de facto a history log. Please, update!

Comment by ruslan.u...@gmail.com, Aug 3, 2010

what's happens with unladen? Why it's development so slow?

Comment by etolle...@gmail.com, Aug 4, 2010

It appears to have been decided to push the unladen swallow JIT into core python (http://www.python.org/dev/peps/pep-3146/) - I suspect this is why it has basically stopped - most of the people are probably working on py3k-jit (the unladen swallow merge into py 3.2/3.3) instead.

Comment by ruslan.u...@gmail.com, Sep 29, 2010

After study some internet resource, i make the conclusion: unladen-swallow is dead, and google switch his developers to new language go

Comment by lesiuk@gmail.com, Nov 13, 2010

yeah unladen-swallow

Comment by siv...@gmail.com, Nov 19, 2010

Looks like ruslan.usifov was right. Google's going with "GO". But Golang has very few libraries.

I think, Unladen Swallow project was not a big success.

I have read an article in the google groups which can be seen here:

http://webcache.googleusercontent.com/search?q=cache:i2k9KDkpOO0J:groups.google.com/group/unladen-swallow/browse_thread/thread/4edbc406f544643e+The+case+against+python+google+groups&cd=1&hl=en&ct=clnk&gl=in

Is this is an alarming thing for developers depending on python ??

Well, anyways since Ruby is also not having good libraries, I think I need to resort to Perl back (and occasionally C).

C/C++ cumbersome to write !? python, Ruby slow and GIT issue !? Java uncertain future !? Erlang, Groovy, Scala, Clojure half-baked !?

Perl is only left. So, I am going to perl ship.

Cheers.

Comment by mich.mierzwa@gmail.com, Nov 19, 2010

I think they realized, they are not able to optimize it as they claimed. They did what they could. That's it. Google Go despite its mission is neither as easy as Python nor as quick as Java (do not mention about c). And there is still no version for Windows. Plugin for Eclipse is also at very early stage. Today there is no language which could fulfill those statements (very quick in development like Python and fast as Java or even in the worst case as V8). My little and dirty app works under Unladen Swallow slower than under pure CPython. I am waiting for PyPy? making progress (ver. 2.6 at least). HotPy? is interesting as well as it promises to solve GIL problem once and forever. And by now I think that Cython can be my choice (but it is not the same as regular Python). UnladenSwallow? let down my hopes.

Comment by lost.goblin, Nov 23, 2010

mich.mierzwa: A port to Windows has been part of the main Go distribution for months now, and also there is the new erGo distribution specially for windows: http://ergolang.com/

As for Go's performance, there has been some optimizations over the last year, and it matches Java in most benchmarks (and is better and worse in others) while using considerably less memory and having a tiny fraction of the startup time.

Still, obviously there is plenty of room for improvement in Go's performance which so far has not been a focus in its development.

Also there are plenty of Go libraries already (and the std lib is quite comprehensive), see: http://go-lang.cat-v.org/pure-go-libs and http://go-lang.cat-v.org/library-bindings

Comment by mich.mierzwa@gmail.com, Nov 24, 2010

lost.goblin:

Google Windows port works under MinGW which is not as useful and it still does not have all features. erGO is not for free. I am a hobbyist who writes programs mostly for itself so I cannot afford to pay for it.

When I wrote about performance I based on alioth shootout where it is about 3 times slower (and it is not faster in any test).

It seems that you used both: Python and Go. I would like to ask you how in your opinion it compares as regards speed and ease of development. Among few languages I know (c, java, c#, VB, js) Python is my favorite. If it only could be faster... :(

Comment by ahmetnov...@gmail.com, Jul 4, 2011

Still alive? :)

Comment by changbin...@gmail.com, Aug 7, 2011

seem dead

http://en.wikipedia.org/wiki/Unladen_Swallow

Comment by lost.goblin, Oct 14, 2011

mich.mierzwa:

The Go Windows port is pretty much complete, and you can download pre-packaged binaries here: http://code.google.com/p/gomingw/downloads/list

Go is already faster than Java in many benchmarks, even if it has not been optimized much, and it uses many times less memory than Java, see: http://shootout.alioth.debian.org/u64/benchmark.php?test=all&lang=go&lang2=java

As for comparing Go and Python, Go is not only much faster, it is also much simpler, its type system is clean elegant and free of black magic like metaclasses. Python code tends to be too smart for its own good, Go code just does what it says, and says what it does, it is easy to write and easy to read.

Comment by mich.mierzwa@gmail.com, Oct 15, 2011

@lost.goblin Thank you for information. I have seen that they plan to release first non beta version and I am happy to see, that Windows is one of the architectures it will run. I will go to alioth in a minute but according to what you said they improved it much. That's really good. I cannot believe you that it can be simpler. Probably because I am used to Python and Go know only from learning materials (also last interactive tour they released was fantastic). But good to hear that for you it is more convenient, it means it is at least as good as Python approach. I still dream about language which is as clean, easy and flexible as Python and which would be able to replace c++ and Java. Do you think Go is capable of that? If I only find time and there is working Windows version along with comfortable environment (Eclipse) I will give it a chance with a pleasure. By the way, I have not used that yet but guys working on PyPy? had made significant progress too and it looks very promising. Thanks again Michal

Comment by the961be...@gmail.com, Mar 17, 2015

Comment by guzbr...@gmail.com, Mar 20, 2015

I do not understand with you discussed, but at least I have to learn to know .... http://www-aetnahealth.blogspot.com/

Comment by mohazmia...@gmail.com, Mar 28, 2015

its wonderful to say, that project will i try like it

http://bagas31game.blogspot.com/

Comment by asepmesi...@gmail.com, Apr 8, 2015

http://infokomputerrakitan.blogspot.com/ This is my first time i visit here. I found so many entertaining stuff in your blog, especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here! Keep up the good work. http://infokomputerrakitan.blogspot.com/2014/12/pengertian-dan-fungsi-processor-komputer.html"

Comment by drmani...@gmail.com, Apr 10, 2015

Code langguage, really makes me confuse :D may be sometime i can learn about this.

http://terbaikhosting.com/

Comment by zoeybell...@gmail.com, Apr 17, 2015

I'm sad because they can not understand. http://www.zoeydownload.com/ http://www.ardy97.com/2015/03/download-naruto-last-movie-10-subtitle-indonesia-bluray.html

Comment by dendiapr...@gmail.com, Apr 28, 2015

Your web site provided us with valuable information to work on. You've done an impressive task and our whole neighborhood can be thankful to you.

Also visit my web site :

http://torufukuda.blogspot.com/

http://moraisfabio.blogspot.com/

http://sugar-in-my-tea.blogspot.com/

http://andycappworld.blogspot.com/

http://fikri-alem.blogspot.com/

Comment by kobayaka...@gmail.com, May 9, 2015

in redhat, we create anything with phyton, of course! phyton is an important language for programmers

http://www.teknoreview.net/2015/03/update71-clash-of-clans-hack.html http://www.teknoreview.net

Comment by lastform...@gmail.com, May 9, 2015

after reading this topic, I came to understand little by little

http://powerofjones.blogspot.com

Comment by veraputr...@gmail.com, May 9, 2015

http://bursapoker.smeagol.web.id/jinpoker-com/ http://pasarpoker.smeagol.web.id/jinpoker-com/

Comment by sanygia...@gmail.com, May 19, 2015

i am so confused, but thanks for your sharing...

harga paket wedding cake di jakarta

Comment by radenmud...@gmail.com, May 31, 2015

I still do not know something about this tutorial please guide me to better understand this tutorial ? http://rebellahome.com/

Comment by agilniea...@gmail.com, Jun 12, 2015

I do not understand with you discussed, but at least I have to learn to know

http://clinicanoesis.com

Comment by dasub...@gmail.com, Jun 16, 2015

usefull article, my reference

http://ad-itech.blogspot.com

Comment by nxt...@gmail.com, Jun 25, 2015

thanks so much http://www.satriamusic.com & http://onlagump3.wapka.mobi download lagu

Comment by agilniea...@gmail.com, Jul 2, 2015

Hi, i think that i saw you visited my site so i came to return the prefer?.I'm attempting to find things to enhance my website!I guess its ok to make use of some of your ideas!!

ignou.ac.in

Comment by sahesaox...@gmail.com, Jul 5, 2015

It scipt what? Please explain to me, because I want to learn http://goo.gl/IpuXEf | http://goo.gl/BLHev4

Comment by isrofi...@gmail.com, Jul 19, 2015

the creator of ruby is twitting this http://www.wikiharga.com/2015/06/harga-pipa-pvc-2015.html

Comment by isrofi...@gmail.com, Jul 24, 2015

http://dunia-batu-akik.blogspot.com/2015/07/harga-batu-akik-borneo-kalimantan.html

Comment by busuparm...@gmail.com, Aug 18, 2015

I am not sure what is written in this blog , please explain in detail because I want to learn http://www.goapindulgelaran.com

Comment by pragik.k...@gmail.com, Aug 19, 2015

quite confusing for a beginner like me, but I will keep trying only with friends http://www.caypul.com only

Comment by prabukra...@gmail.com, Aug 19, 2015

Thanks for this list of software. It's very useful information http://mr-yellows.blogspot.com/2015/08/arti-kata-baper-mager-gabut-di-sosmed.html

Comment by enengvi...@gmail.com, Aug 23, 2015

can be explained in more detail again I do not understand http://diabetesmelitus.obatpenyakit.co.id