Contents:
Useful Tools and Links
Here i will describe the most usable tools that needed in-order to optimize your python code.
- It is always helpful to know your code, if you not sure what happens under the hood, you should look for the python source code. I am currently using Python 2.7 (cPython).
- Disassemble your code, the dis module is the Python disassembler. An example use of this tool is given in my Run-time resolution example.
- Profile your code. If you are interested in performance you should be familiar with timeit module. I use it frequently in my examples. Additionally you may want to check different existing profilers, like cProfiler.
Boost performance by embedding C code
There are many methods for embedding external C code (e.g. SWIG - C/C++ module wrapping into Python, PyInline - Inline code from other languages directly into Python, etc..). I will present "PyInline" as my favorite method to use C directly in Python.
Requirements:
Requirements:
Next, i demonstrate a very simple function which computes the sum of all character's ASCII values in a string (without checking overflow - "double" is 8 bytes long).
I will also provide a python version for doing the same task. I'll finish with a small benchmark to emphasize the performance impact of embedding c code.
I will also provide a python version for doing the same task. I'll finish with a small benchmark to emphasize the performance impact of embedding c code.
import PyInline,__main__ m = PyInline.build(code=''' #includedouble sum_of_chars(char *s, int m) { int i=0; double sum_c = s[0]; for(i = 1; i < m; i++) sum_c += s[i]; return sum_c; } ' '', targetmodule=__main__, language="C")
Testing and using the C version together with additional interesting python versions.
import array text = "a"*50 lentext = len(text) print m.sum_of_chars(text,lentext) #4850.0 print sum(ord(ch) for ch in text) #4850 print sum(array.array("B", text)) #4850 print sum(map(ord, text )) #4850 print sum(bytearray(text)) #4850
Benchmark Testing:
import timeit prepare = """ import array import PyInline,__main__ m = PyInline.build(code=''' #include <string.h> double sum_of_chars(char *s, int m) { int i=0; double sum_c = s[0]; for(i = 1; i < m; i++) sum_c += s[i]; return sum_c; } ''', targetmodule=__main__, language="C") text = "a"*50 lentext = len(text) """ sum_c_code = """ x = m.sum_of_chars(text,lentext) """ sum_python_code = """ x = sum(ord(ch) for ch in text) """ sum_array_python_code = """ x = sum(array.array("B", text)) """ sum_map_python_code = """ x = sum(map(ord, text )) """ sum_bytearray_python_code = """ x = sum(bytearray(text)) """ benchmark = timeit.Timer(sum_python_code, prepare) print "Sum of chars Python code", benchmark.timeit(1000), "seconds" benchmark = timeit.Timer(sum_map_python_code, prepare) print "Sum of chars using map Python code", benchmark.timeit(1000), "seconds" benchmark = timeit.Timer(sum_array_python_code, prepare) print "Sum of chars using array Python code", benchmark.timeit(1000), "seconds" benchmark = timeit.Timer(sum_bytearray_python_code, prepare) print "Sum of chars using bytearray Python code", benchmark.timeit(1000), "seconds" benchmark = timeit.Timer(sum_c_code, prepare) print "Sum of chars - C code", benchmark.timeit(1000), "seconds"
Results:
It is clear that a C code version is faster by a factor of more then 3.
It is clear that a C code version is faster by a factor of more then 3.
Sum of chars Python code 0.00469453051701 seconds Sum of chars using map Python code 0.00384867686766 seconds Sum of chars using array Python code 0.000912136417634 seconds Sum of chars using bytearray Python code 0.00079881426542 seconds Sum of chars - C code 0.000260854765474 seconds
Rabin-Karp Sub-String Search
Most of python users that look for a way to search a sub-string in text, are using "in" operator that invokes the __contains__ built-in function (find(sub-string,string) uses it also)
print 'hello' in text
If you check further in the source code of cPython, a c function named fastsearch is executed. This function is inspired by Boyer-Moore algorithm.
I will demonstrate a search based on Rabin-Karp algorithm that i have implemented in python.
For those of us familiar with Rabin-Karp algorithm, must know that a proper rolling hash-function need to be selected. I have selected a simple XOR function to demonstrate how the algorithm works.
I will demonstrate a search based on Rabin-Karp algorithm that i have implemented in python.
For those of us familiar with Rabin-Karp algorithm, must know that a proper rolling hash-function need to be selected. I have selected a simple XOR function to demonstrate how the algorithm works.
def hashLongString(iterable,step): len_iter = len(iterable)-step """calculate initial hash for first prefix of text, using xor rolling function""" local_hash256 = reduce(int.__xor__, map(ord, iterable[:step])) yield 0,local_hash256 for i in xrange(len_iter): local_hash256 = int.__xor__(int.__xor__(local_hash256,ord(iterable[i])),ord(iterable[i+step])) yield i+1,local_hash256 def RabinKarp(keyword,text): """calculate initial hash for keyword, using xor rolling function""" keyword_len = len(keyword) hash256 = reduce(int.__xor__, map(ord, keyword)) """using a generator to iterate through the text the generator will return a hashed value of the next substring in text""" for i,h in hashLongString(text,keyword_len): if h==hash256: """if hash values are equal, compare substrings character by character""" for o_ind,ind in enumerate(xrange(i,i+keyword_len)): if text[ind] != keyword[o_ind]: break else: return True return False
Python Caching & Memoization Decorators
"A Python decorator is a specific change to the Python syntax that allows us to more conveniently alter functions and methods (and possibly classes in a future version). This supports more readable applications of the DecoratorPattern but also other uses as well." cited from here.
I will use the decorator functionality to demonstrate a way to improve the performance and memory usage of some popular recursive functions.
I will use the decorator functionality to demonstrate a way to improve the performance and memory usage of some popular recursive functions.
Fibonacci
Fibonacci series described here, is a series of numbers that defined as such:
f(n) = f(n-1) + f(n-2)
The following python code is a naive implementation of a recursive fibonacci function (i will not describe the iterative approach since my main goal is to present how we may improve the recursive implementation by applying python decorator on it):
f(n) = f(n-1) + f(n-2)
The following python code is a naive implementation of a recursive fibonacci function (i will not describe the iterative approach since my main goal is to present how we may improve the recursive implementation by applying python decorator on it):
def naive_fib(n): if n in (0, 1): return n return naive_fib(n-1) + naive_fib(n-2)
Now, lets add a memoization. The goal of memoization is to remember the results of already computed f(x), and removing the unnecessary stack calls for the same computation of f(x) by returning the saved result from the first computation.
e.g.
f(n) = f(n-1)+f(n-2) = f(n-2) + f(n-3) + f(n-3) + f(n-4) = ... notice that f(n-3) will be computed twice without the memoization optimization.
e.g.
f(n) = f(n-1)+f(n-2) = f(n-2) + f(n-3) + f(n-3) + f(n-4) = ... notice that f(n-3) will be computed twice without the memoization optimization.
memo = {0:0, 1:1}
def fib(n):
if not n in memo:
memo[n] = fib(n-1) + fib(n-2)
return memo[n]We improved the function by adding the memo dictionary and altering the fibonacci function so that it will use this dictionary.
But what if the function will be more complicated (e.g. Subset Sum), and we wouldn't like to change the function it self???
Here is the answer, lets use a decorator "@":
But what if the function will be more complicated (e.g. Subset Sum), and we wouldn't like to change the function it self???
Here is the answer, lets use a decorator "@":
import functools # note that this decorator ignores **kwargs def memoize(obj): cache = obj.cache = {} @functools.wraps(obj) def memoizer(*args, **kwargs): if args not in cache: cache[args] = obj(*args, **kwargs) return cache[args] return memoizer @memoize def fib(n): if n in (0, 1): return n return fib(n-1) + fib(n-2)
Lets see some benchmark time results for fib(20), measuring time after 1000 runs and clearing the cache for each run:
Naive implementation 3.89636282931 seconds Memoization Decorator 0.03559726761 seconds
Important: In many cases Memoization will boost performance , but you have to take into account that memory is bounded , therefore in most of the cases you have to use a cache memory management policy, which will best economize your memory usage and still provide you good performance.
Many different recipes for LRU and LFU cache decorator exists, you may find good recipes here.
Additionally, starting from python 3.2 lru_cache decorator was added to the functools module.
Many different recipes for LRU and LFU cache decorator exists, you may find good recipes here.
Additionally, starting from python 3.2 lru_cache decorator was added to the functools module.
Subset Sum
In computer science, the subset sum problem is an important problem in complexity theory and cryptography:
"given a set of integers and an integer s, does any non-empty subset sum to s".
We will generelize this problem to finding the number of all subsets that sum to s. The complexity of the next solution is O(2^n), later we will try to improve the performance by using memoization decorator with some tuning.
A trivial recursive solution for this problem which will return the number of such subsets that equal to the sum s:
"given a set of integers and an integer s, does any non-empty subset sum to s".
We will generelize this problem to finding the number of all subsets that sum to s. The complexity of the next solution is O(2^n), later we will try to improve the performance by using memoization decorator with some tuning.
A trivial recursive solution for this problem which will return the number of such subsets that equal to the sum s:
def subset_sum(arr, i, S): if i >= len(arr): return 1 if S == 0 else 0 return subset_sum(arr, i + 1, S) + subset_sum(arr, i + 1, S - arr[i]) arr = [1, -3,20, 3, 10,-2] print(subset_sum(arr, 0, 20)) #2
Subset problem solution with memoization (notice that we have slightly extended the memoization decorator by removing the reference to the arr-list parameter that shouldn't be searched in cache):
import functools # note that this decorator ignores **kwargs def memoize(obj): cache = obj.cache = {} @functools.wraps(obj) def memoizer(*args, **kwargs): nonlist_args = args[1:] #arguments without list parameter if nonlist_args not in cache: cache[nonlist_args] = obj(*args, **kwargs) return cache[nonlist_args] return memoizer @memoize def subset_sum(arr, i, S): if i >= len(arr): return 1 if S == 0 else 0 return subset_sum(arr, i + 1, S) + subset_sum(arr, i + 1, S - arr[i]) arr = [1, -3,20, 3, 10,-2] print(subset_sum(arr, 0, 20)) #2
Benchmark results for searching sum=0 in arr=[2,-1]*10, measuring time for 20 runs and clearing the cache for each run.
Naive 6.97739714839 seconds With Memoization 0.00817843834377 seconds
Functions / Variables Run-time resolution (handy technique for optimizing performance)
I will describe here a very handy method to boost performance of your python code.
Here is an example for not optimized code, can you find some of the problems with this code?
Here is an example for not optimized code, can you find some of the problems with this code?
s = list() for i in xrange(10000): s.append(i)
One of the problem is that additionally to calling append each time, "s" is loaded repeatedly and than the function attribute "append" is searched in "s" each time.
If only save the function pointer locally you will gain in performance by allowing to function load locally without loading "s".
Here is more optimized code version:
If only save the function pointer locally you will gain in performance by allowing to function load locally without loading "s".
Here is more optimized code version:
s = list() app = s.append for i in xrange(10000): app(i)
For a proof of concept i will also disassembly both function in order to demonstrate the difference (i advice to use disassembly from time to time when optimization is needed.
import dis
dis.dis(notOptimizaedFunction)
dis.dis(OptimizaedFunction)Not Optimized
0 LOAD_GLOBAL 0 (list)
3 CALL_FUNCTION 0
6 STORE_FAST 0 (s)
9 SETUP_LOOP 33 (to 45)
12 LOAD_GLOBAL 1 (xrange)
15 LOAD_CONST 1 (10)
18 CALL_FUNCTION 1
21 GET_ITER
22 FOR_ITER 19 (to 44)
25 STORE_FAST 1 (i)
28 LOAD_FAST 0 (s)
31 LOAD_ATTR 2 (append)
34 LOAD_FAST 1 (i)
37 CALL_FUNCTION 1
40 POP_TOP
41 JUMP_ABSOLUTE 22
44 POP_BLOCK
45 LOAD_CONST 0 (None)
48 RETURN_VALUE
Optimized
0 LOAD_GLOBAL 0 (list)
3 CALL_FUNCTION 0
6 STORE_FAST 0 (s)
9 LOAD_FAST 0 (s)
12 LOAD_ATTR 1 (append)
15 STORE_FAST 1 (app)
18 SETUP_LOOP 30 (to 51)
21 LOAD_GLOBAL 2 (xrange)
24 LOAD_CONST 1 (10)
27 CALL_FUNCTION 1
30 GET_ITER
31 FOR_ITER 16 (to 50)
34 STORE_FAST 2 (i)
37 LOAD_FAST 1 (app)
40 LOAD_FAST 2 (i)
43 CALL_FUNCTION 1
46 POP_TOP
47 JUMP_ABSOLUTE 31
50 POP_BLOCK
51 LOAD_CONST 0 (None)
54 RETURN_VALUE
The loop assembly code in the Non optimized version (22-41) executes LOAD_FAST (s) code line 28 and than LOAD_ATTR (append) code line 31, vs the optimized version loop (31-47) where instead of those 2 commands only executes LOAD_FAST (app) code line 37.
Here are benchmark results:
Here are benchmark results:
import timeit print "Non optimized %f" % \ timeit.Timer('for i in xrange(10000): s.append(i)', 's = list()').timeit(1000) print "Optimized %f" % \ timeit.Timer('for i in xrange(10000): app(i)', 's = list(); app = s.append').timeit(1000)
Non optimized 1.246844 Optimized 0.727130
As a summary, i have given simple example for faster function resolution, the same technique is efficient for variables and objects