Mark & Click - Python Interactive Tutorial
  • Beginner
  • Advance
  • Blog

Contents:

  • Useful tools and links
  • Boost performance by embedding C code.
  • Rabin-Karp Sub-String Search.
  • Python Caching & Memoization Decorators
    • Fibonacci
    • Subset Sum
  • Functions / Variables Run-time resolution (Performance optimization)


Useful Tools and Links

Here i will describe the most usable tools that needed in-order to optimize your python code.
  • It is always helpful to know your code, if you not sure what happens under the hood, you should look for the python source code. I am currently using Python 2.7 (cPython).
  • Disassemble your code, the dis module is the Python disassembler. An example use of this tool is given in my Run-time resolution example.
  • Profile your code. If you are interested in performance you should be familiar with timeit module. I use it frequently in my examples. Additionally you may want to check different existing profilers, like cProfiler.

Boost performance by embedding C code

There are many methods for embedding external C code (e.g. SWIG - C/C++ module wrapping into Python, PyInline - Inline code from other languages directly into Python, etc..). I will present "PyInline" as my favorite method to use C directly in Python.

Requirements:
  • PyInline download.
Next, i demonstrate a very simple function which computes the sum of all character's ASCII values in a string (without checking overflow - "double" is 8 bytes long).
I will also provide a python version for doing the same task. I'll finish with a small benchmark to emphasize the performance impact of embedding c code.
import PyInline,__main__
m = PyInline.build(code='''
    #include 

    double sum_of_chars(char *s, int m)
    {
        int i=0;
        double sum_c = s[0];
        for(i = 1; i < m; i++)
            sum_c += s[i];
        return sum_c;
    }
    ''',
  targetmodule=__main__, language="C")
Testing and using the C version together with additional interesting python versions.
import array
text = "a"*50
lentext = len(text)

print m.sum_of_chars(text,lentext) #4850.0
print sum(ord(ch) for ch in text)  #4850
print sum(array.array("B", text))  #4850
print sum(map(ord, text ))         #4850
print sum(bytearray(text))         #4850
Benchmark Testing:
import timeit
prepare = """
import array
import PyInline,__main__
m = PyInline.build(code='''
    #include <string.h>

    double sum_of_chars(char *s, int m)
    {
        int i=0;
        double sum_c = s[0];
        for(i = 1; i < m; i++)
            sum_c += s[i];
        return sum_c;
    }
    ''',
  targetmodule=__main__, language="C")
  
text = "a"*50
lentext = len(text)
"""
sum_c_code = """
   x = m.sum_of_chars(text,lentext) 
"""
sum_python_code = """
   x = sum(ord(ch) for ch in text) 
"""
sum_array_python_code = """
   x = sum(array.array("B", text))
"""
sum_map_python_code = """
   x = sum(map(ord, text ))
"""
sum_bytearray_python_code = """
   x = sum(bytearray(text))
"""


benchmark = timeit.Timer(sum_python_code, prepare)
print "Sum of chars Python code", benchmark.timeit(1000), "seconds"
benchmark = timeit.Timer(sum_map_python_code, prepare)
print "Sum of chars using map Python code", benchmark.timeit(1000), "seconds"
benchmark = timeit.Timer(sum_array_python_code, prepare)
print "Sum of chars using array Python code", benchmark.timeit(1000), "seconds"
benchmark = timeit.Timer(sum_bytearray_python_code, prepare)
print "Sum of chars using bytearray Python code", benchmark.timeit(1000), "seconds"
benchmark = timeit.Timer(sum_c_code, prepare)
print "Sum of chars - C code", benchmark.timeit(1000), "seconds"
Results:
It is clear that a C code version is faster by a factor of more then 3.
Sum of chars Python code 0.00469453051701 seconds
Sum of chars using map Python code 0.00384867686766 seconds
Sum of chars using array Python code 0.000912136417634 seconds
Sum of chars using bytearray Python code 0.00079881426542 seconds
Sum of chars - C code 0.000260854765474 seconds

Rabin-Karp Sub-String Search

Most of python users that look for a way to search a sub-string in text, are using "in" operator that invokes the __contains__ built-in function (find(sub-string,string) uses it also)
print 'hello' in text
If you check further in the source code of cPython, a c function named fastsearch is executed. This function is inspired by Boyer-Moore algorithm.

I will demonstrate a search based on Rabin-Karp algorithm that i have implemented in python.

For those of us familiar with Rabin-Karp algorithm, must know that a proper rolling hash-function need to be selected. I have selected a simple XOR function to demonstrate how the algorithm works.
 
def hashLongString(iterable,step):
    len_iter = len(iterable)-step
    """calculate initial hash for first prefix of text, using xor rolling function"""
    local_hash256 = reduce(int.__xor__, map(ord, iterable[:step]))
    yield 0,local_hash256
    for i in xrange(len_iter):
      local_hash256 = int.__xor__(int.__xor__(local_hash256,ord(iterable[i])),ord(iterable[i+step]))
      yield  i+1,local_hash256

def RabinKarp(keyword,text):
    """calculate initial hash for keyword, using xor rolling function"""
    keyword_len = len(keyword)  
    hash256 = reduce(int.__xor__, map(ord, keyword))
    """using a generator to iterate through the text
       the generator will return a hashed value of the next
       substring in text"""
    for i,h in hashLongString(text,keyword_len):
        if h==hash256:
            """if hash values are equal, compare substrings
            character by character"""
            for o_ind,ind in enumerate(xrange(i,i+keyword_len)):
                if text[ind] != keyword[o_ind]:
                    break
            else:
                return True
    return False

Python Caching & Memoization Decorators

"A Python decorator is a specific change to the Python syntax that allows us to more conveniently alter functions and methods (and possibly classes in a future version). This supports more readable applications of the DecoratorPattern but also other uses as well." cited from here.

I will use the decorator functionality to demonstrate a way to improve the performance and memory usage of some popular recursive functions.

Fibonacci

Fibonacci series described here, is a series of numbers that defined as such:
f(n) = f(n-1) + f(n-2)

The following python code is a naive implementation of a recursive fibonacci function (i will not describe the iterative approach since my main goal is to present how we may improve the recursive implementation by applying python decorator on it):
def naive_fib(n):
    if n in (0, 1): return n
    return naive_fib(n-1) + naive_fib(n-2)
Now, lets add a memoization. The goal of memoization is to remember the results of already computed f(x), and removing the unnecessary stack calls for the same computation of f(x) by returning the saved result from the first computation.
e.g.
f(n) = f(n-1)+f(n-2) = f(n-2) + f(n-3)   +  f(n-3) + f(n-4) = ...    notice that f(n-3) will be computed twice without the memoization optimization.
memo = {0:0, 1:1}
def fib(n):
    if not n in memo:
        memo[n] = fib(n-1) + fib(n-2)
    return memo[n]
We improved the function by adding the memo dictionary and altering the fibonacci function so that it will use this dictionary.
But what if the function will be more complicated (e.g. Subset Sum), and we wouldn't like to change the function it self???
Here is the answer, lets use a decorator "@":
import functools
# note that this decorator ignores **kwargs
def memoize(obj):
    cache = obj.cache = {}

    @functools.wraps(obj)
    def memoizer(*args, **kwargs):
        if args not in cache:
            cache[args] = obj(*args, **kwargs)
        return cache[args]
    return memoizer

@memoize
def fib(n):
    if n in (0, 1): return n
    return fib(n-1) + fib(n-2)
Lets see some benchmark time results for fib(20), measuring time after 1000 runs and clearing the cache for each run:
Naive implementation 3.89636282931 seconds
Memoization Decorator 0.03559726761 seconds
Important: In many cases Memoization will boost performance , but you have to take into account that memory is bounded , therefore in most of the cases you have to use a cache memory management policy, which will best economize your memory usage and still provide you good performance.

Many different recipes for LRU and LFU cache decorator exists, you may find good recipes here.
Additionally, starting from python 3.2 lru_cache decorator was added to the functools module.

Subset Sum

In computer science, the subset sum problem is an important problem in complexity theory and cryptography:
"given a set of integers and an integer s, does any non-empty subset sum to s".

We will generelize this problem to finding the number of all subsets that sum to s. The complexity of the next solution is O(2^n), later we will try to improve the performance by using memoization decorator with some tuning.

A trivial recursive solution for this problem which will return the number of such subsets that equal to the sum s:
def subset_sum(arr, i, S):
    if i >= len(arr): 
        return 1 if S == 0 else 0
    return subset_sum(arr, i + 1, S) + subset_sum(arr, i + 1, S - arr[i])

arr = [1, -3,20, 3, 10,-2]
print(subset_sum(arr, 0, 20)) #2
Subset problem solution with memoization (notice that we have slightly extended the memoization decorator by removing the reference to the arr-list parameter that shouldn't be searched in cache):
import functools
# note that this decorator ignores **kwargs
def memoize(obj):
    cache = obj.cache = {}

    @functools.wraps(obj)
    def memoizer(*args, **kwargs):
        nonlist_args = args[1:]        #arguments without list parameter
        if nonlist_args not in cache:
            cache[nonlist_args] = obj(*args, **kwargs)
        return cache[nonlist_args]
    return memoizer

@memoize
def subset_sum(arr, i, S):
    if i >= len(arr): 
        return 1 if S == 0 else 0
    return subset_sum(arr, i + 1, S) + subset_sum(arr, i + 1, S - arr[i])

arr = [1, -3,20, 3, 10,-2]
print(subset_sum(arr, 0, 20)) #2
Benchmark results for searching sum=0 in arr=[2,-1]*10, measuring time for 20 runs and clearing the cache for each run.
Naive 6.97739714839 seconds
With Memoization 0.00817843834377 seconds

Functions / Variables Run-time resolution (handy technique for optimizing performance)

I will describe here a very handy method to boost performance of your python code.

Here is an example for not optimized code, can you find some of the problems with this code?
s = list()
for i in xrange(10000): 
    s.append(i)
One of the problem is that additionally to calling append each time, "s" is loaded repeatedly and than the function attribute "append" is searched in "s" each time. 
If only save the function pointer locally you will gain in performance by allowing to function load locally without loading "s".

Here is more optimized code version:
s = list()
app = s.append
for i in xrange(10000): 
    app(i)
For a proof of concept i will also disassembly both function in order to demonstrate the difference (i advice to use disassembly from time to time when optimization is needed.
import dis

dis.dis(notOptimizaedFunction)
dis.dis(OptimizaedFunction)

Not Optimized

             0 LOAD_GLOBAL              0 (list)
             3 CALL_FUNCTION            0
             6 STORE_FAST               0 (s)

             9 SETUP_LOOP              33 (to 45)
            12 LOAD_GLOBAL              1 (xrange)
            15 LOAD_CONST               1 (10)
            18 CALL_FUNCTION            1
            21 GET_ITER
            22 FOR_ITER                19 (to 44)
            25 STORE_FAST               1 (i)

            28 LOAD_FAST                0 (s)
            31 LOAD_ATTR                2 (append)
            34 LOAD_FAST                1 (i)
            37 CALL_FUNCTION            1
            40 POP_TOP
            41 JUMP_ABSOLUTE           22
            44 POP_BLOCK
            45 LOAD_CONST               0 (None)
            48 RETURN_VALUE

Optimized

             0 LOAD_GLOBAL              0 (list)
             3 CALL_FUNCTION            0
             6 STORE_FAST               0 (s)

             9 LOAD_FAST                0 (s)
            12 LOAD_ATTR                1 (append)
            15 STORE_FAST               1 (app)

            18 SETUP_LOOP              30 (to 51)
            21 LOAD_GLOBAL              2 (xrange)
            24 LOAD_CONST               1 (10)
            27 CALL_FUNCTION            1
            30 GET_ITER
            31 FOR_ITER                16 (to 50)
            34 STORE_FAST               2 (i)

            37 LOAD_FAST                1 (app)
            40 LOAD_FAST                2 (i)
            43 CALL_FUNCTION            1
            46 POP_TOP
            47 JUMP_ABSOLUTE           31
            50 POP_BLOCK
            51 LOAD_CONST               0 (None)
            54 RETURN_VALUE

The loop assembly code in the Non optimized version (22-41) executes LOAD_FAST (s) code line 28 and than LOAD_ATTR (append) code line 31, vs the optimized version loop (31-47) where instead of those 2 commands only executes LOAD_FAST (app) code line 37.

Here are benchmark results:
import timeit

print "Non optimized %f" % \
    timeit.Timer('for i in xrange(10000): s.append(i)', 's = list()').timeit(1000)

print "Optimized %f" % \
    timeit.Timer('for i in xrange(10000): app(i)', 's = list(); app = s.append').timeit(1000)

Non optimized 1.246844
Optimized 0.727130

As a summary, i have given simple example for faster function resolution, the same technique is efficient for variables and objects

Submit