2010-01-08

Emulating Stackless Python using greenlet

This blog post documents why and how to emulate Stackless Python using greenlet.

Stackless Python is an extended version of the Python language (and its CPython reference implementation). New features include lightweight coroutines (called tasklets), communication primitives using message passing (called channels), manual and/or automatic coroutine scheduling, not using the C stack Python function calls, and serialization of coroutines (for reloading in another process). Stackless Python could not be implemented as a Python extension module – the core of the CPython compiler and interpreter had to be patched.

greenlet is an extension module to CPython providing coroutines and low-level (explicit) scheduling. The most important advantage of greenlet over Stackless Python is that greenlet could be implemented as a Python extension module, so the whole Python interpreter doesn't have to be recompiled in order to use greenlet. Disadvantages of greenlet include speed (Stackless Python can be 10%, 35% or 900% faster, depending on the workflow); possible memory leaks if coroutines have references to each other; and that the provided functionality is low-level (i.e. only manual coroutine scheduling, no message passing provided).

greenstackless, the Python module I've recently developed, provides most of the (high-level) Stackless Python API using greenlet, so it eliminates the disadvantage of greenlet that it is low-level. See the source code and some tests (the latter with tricky corner cases). Please note that although greenstackless is optimized a bit, it can be much slower than Stackless Python, and it also doesn't fix the memory leaks. Using greenstackless is thus not recommended in production environments; but it can be used as a temporary, drop-in replacement for Stackless Python if replacing the Python interpreter is not feasible.

Some other software that emulates Stackless using greenlet:

  • part of Concurrence: doesn't support stackless.main, tasklet.next, tasklet.prev, tasklet.insert, tasklet.remove, stackless.schedule_remove, doesn't send exceptions properly. (Because of these features missing, it doesn't pass the unit test above.)
  • part of pypy doesn't support stackless.main, tasklet.next, tasklet.prev, doesn't pass the unit test above.

For the other way round (emulating greenlet using Stackless), see greenlet_fix (source code). Although Stackless is a bit faster than greenlet, the emulation in greenlet_tix makes it about about 20% slower than native greenlet.

My original purpose for this emulation was to use gevent with Stackless and see if it becomes faster (than the default greenlet). It turned out to become slower. Then I benchmarked Concurrence (with libevent and Stackless by default), pyevent with Stackless, pyevent with greenlet, gevent (with libevent and greenlet by default), Syncless (with epoll and Stackless by default), eventlet and Tornado (using callbacks), and found out the following:

  • Syncless was the fastest in my nonscientific benchmark, which was suprising since it had a clumsy event loop implementation in pure Python.
  • Wrapping libevent's struct evbuffer in Pyrex for line buffering is slower than manual buffering.
  • Using input buffering (for readline()) is much slower than manual buffering (reading until '\n\r\n' and splitting the input by \n).
  • Providing a proper WSGI environment is much slower than ad-hoc parsing of the first line of the HTTP request.
  • Wrapping libevent's struct evbuffer and the coroutine switching in Pyrex made it about 5% faster than manual buffering.
  • Syncless does too much unnecessary communication (using Stackless channels) between a worker and a main loop. This can be simplified and made faster using stackless.schedule_remove(). So the current Syncless implementation is a dead end, it should be rewritten from scratch.

My conclusion was that in order to get the fastest coroutine-based, non-blocking, line-buffering-capable I/O library for Python, I should wrap libevent (including event registration and I/O buffering) and Stackless and the WSGI server in hand-optimized Pyrex, manually checking the .c file Pyrex generates for inefficiencies. I'll be doing so soon.

9 comments:

Denis said...

Did anyone do it the other way? That is, implemented greenlet API on Stackless. That would be useful to run greenlet-based libraries on Stackless, such as gevent.

pts said...

@Denis: I've just written a greenlet emulator using Stackless. gevent works. It seems to be about 20% slower than greenlet. Get it from here: http://code.google.com/p/syncless/source/browse/trunk/greenlet_fix.py

Denis said...

@Peter, Good job!

I've tried it and was able to run the examples on Stackless Python 2.6.4!

A few unit tests fail, but this really can become a good starting point for porting gevent to Stackless.

The code looks a bit complicated which is reasonable, because the task is complicated. This gives an
idea that maybe the right way to go about running gevent on Stackless is not to emulate greenlet,
but instead to re-implement some of the lower-level gevent modules using Stackless' channels.

I'm talking about gevent.hub and gevent.greenlet (also, wait_read()/wait_write() from gevent.socket).
Once these 2 modules work on stackless the rest of gevent should too, since it doesn't use greenlet API
directly. Most of the gevent internally uses gevent.hub.Waiter which is like a poor man's channel.
Plugging in the native channel here should be natural.

What do you think?

pts said...

@Denis: Thanks for the insightful comments about how to make gevent faster with Stackless. It would be an interesting task for me to implement, but I don't have time for that till mid-February.

pts said...

@Denis: I've decided not to work on gevent + Stackless, because it would be too slow even then. I've extended the blog post, see at the end how to write a fast, coroutine-based, non-blocking I/O library for Python.

Denis said...

@Peter,

Did you know that gevent uses libevent's HTTP server for WSGI implementation?

Presumably, this does proper buffering under the hood, and it's done in C.

I know that gevent's sockets don't use libevent's evbuffers and bufferevents - that would be a cool feature to have - but gevent.wsgi does not care about that as it does not use the sockets implemented in Python.

pts said...

@Denis: Even though the WSGI server in gevent-0.11.2 used evhttp, in my tests it was about 10% slower than Syncless' pure Python WSGI server implementation I've written.

Nevertheless, thanks for the tip on considering evhttp to speed up my new WSGI server.

Denis said...

Interesting! I wonder if it has something to do with the difference of hard switching vs. soft switching or there's something else to account for.

Related question I have is whether using event loop coded in C, like libevent's, precludes soft switching. I've read somewhere on Stackless mailing list that if there are C frames on the stack (which in case of libevent-based event loop is always the case) then only hard switching can be used. Can you confirm that if this is indeed the case?

pts said...

@Denis: Yes, I can confirm that if the event loop is not pure Python (i.e. a tasklet blocks in C code, by calling PyStackless_Schedule, or some channel communication function), Stackless will use hard switching for that tasklet. Syncless, as of now, has a pure Python event loop, but the new design I'm considering will have a C event loop with hard switching. First I was concerned with the speed penalty of hard switching, but my preliminary measurements indicated that it is fast enough -- still faster than a pure Python event loop. Once I have my full implementation, I'll repeat the speed test. I'm expecting new design with hard switching to be faster than all of gevent, Concurrence, pyevent and eventlet and Syncless.

I was considering moving just the stackless.schedule_remove() call back to Python to get the advantage of soft switching, but that turned out to be slower.