diff options
author | Peter Hinch <peter@hinch.me.uk> | 2016-03-11 10:50:46 +0000 |
---|---|---|
committer | Damien George <damien.p.george@gmail.com> | 2016-03-11 16:33:36 +0000 |
commit | 21b74604f936ef98b69ede4d089d0e7eabd4a25a (patch) | |
tree | 3d7ae87b4e048d9ff635f8afe11be2e6e5778eee | |
parent | 85d3b6165a00add1e0df004fc167eb81fd579f80 (diff) | |
download | micropython-21b74604f936ef98b69ede4d089d0e7eabd4a25a.tar.gz micropython-21b74604f936ef98b69ede4d089d0e7eabd4a25a.zip |
docs: Add Python speed optimisation guide, including minimal viper ref.
-rw-r--r-- | docs/reference/index.rst | 1 | ||||
-rw-r--r-- | docs/reference/speed_python.rst | 318 |
2 files changed, 319 insertions, 0 deletions
diff --git a/docs/reference/index.rst b/docs/reference/index.rst index 8155c5e29b..ae8d6530af 100644 --- a/docs/reference/index.rst +++ b/docs/reference/index.rst @@ -14,6 +14,7 @@ MicroPython are described in the sections here. repl.rst isr_rules.rst + speed_python.rst .. only:: port_pyboard diff --git a/docs/reference/speed_python.rst b/docs/reference/speed_python.rst new file mode 100644 index 0000000000..4792a6fa97 --- /dev/null +++ b/docs/reference/speed_python.rst @@ -0,0 +1,318 @@ +Maximising Python Speed +======================= + +This tutorial describes ways of improving the performance of MicroPython code. +Optimisations involving other languages are covered elsewhere, namely the use +of modules written in C and the MicroPython inline ARM Thumb-2 assembler. + +The process of developing high performance code comprises the following stages +which should be performed in the order listed. + +* Design for speed. +* Code and debug. + +Optimisation steps: + +* Identify the slowest section of code. +* Improve the efficiency of the Python code. +* Use the native code emitter. +* Use the viper code emitter. + +Designing for speed +------------------- + +Performance issues should be considered at the outset. This involves taking a view +on the sections of code which are most performance critical and devoting particular +attention to their design. The process of optimisation begins when the code has +been tested: if the design is correct at the outset optimisation will be +straightforward and may actually be unnecessary. + +Algorithms +~~~~~~~~~~ + +The most important aspect of designing any routine for performance is ensuring that +the best algorithm is employed. This is a topic for textbooks rather than for a +MicroPython guide but spectacular performance gains can sometimes be achieved +by adopting algorithms known for their efficiency. + +RAM Allocation +~~~~~~~~~~~~~~ + +To design efficient MicroPython code it is necessary to have an understanding of the +way the interpreter allocates RAM. When an object is created or grows in size +(for example where an item is appended to a list) the necessary RAM is allocated +from a block known as the heap. This takes a significant amount of time; +further it will on occasion trigger a process known as garbage collection which +can take several milliseconds. + +Consequently the performance of a function or method can be improved if an object is created +once only and not permitted to grow in size. This implies that the object persists +for the duration of its use: typically it will be instantiated in a class constructor +and used in various methods. + +This is covered in further detail :ref:`Controlling garbage collection <gc>` below. + +Buffers +~~~~~~~ + +An example of the above is the common case where a buffer is required, such as one +used for communication with a device. A typical driver will create the buffer in the +constructor and use it in its I/O methods which will be called repeatedly. + +The MicroPython libraries typically provide optional support for pre-allocated buffers. +For example the ``uart.readinto()`` method allows two options for its argument, an integer +or a buffer. If an integer is supplied it will read up to that number of bytes and +return the outcome: this implies that a buffer is created with a corresponding +memory allocation. Providing a pre-allocated buffer as the argument avoids this. See +the code fragment in :ref:`Caching object references <Caching>` below. + +Floating Point +~~~~~~~~~~~~~~ + +For the most speed critical sections of code it is worth noting that performing +any kind of floating point operation involves heap allocation. Where possible use +integer operations and restrict the use of floating point to sections of the code +where performance is not paramount. + +Arrays +~~~~~~ + +Consider the use of the various types of array classes as an alternative to lists. +The ``array`` module supports various element types with 8-bit elements supported +by Python's built in ``bytes`` and ``bytearray`` classes. These data structures all store +elements in contiguous memory locations. Once again to avoid memory allocation in critical +code these should be pre-allocated and passed as arguments or as bound objects. + +When passing slices of objects such as ``bytearray`` instances, Python creates +a copy which involves allocation. This can be avoided using a ``memoryview`` +object: + +.. code:: python + + ba = bytearray(100) + func(ba[3:10]) # a copy is passed + mv = memoryview(ba) + func(mv[3:10]) # a pointer to memory is passed + +A ``memoryview`` can only be applied to objects supporting the buffer protocol - this +includes arrays but not lists. + +Identifying the slowest section of code +--------------------------------------- + +This is a process known as profiling and is covered in textbooks and +(for standard Python) supported by various software tools. For the type of +smaller embedded application likely to be running on MicroPython platforms +the slowest function or method can usually be established by judicious use +of the timing ``ticks`` group of functions documented +`here <http://docs.micropython.org/en/latest/pyboard/library/time.html>`_. +Code execution time can be measured in ms, us, or CPU cycles. + +The following enables any function or method to be timed by adding an +``@timed_function`` decorator: + +.. code:: python + + def timed_function(f, *args, **kwargs): + myname = str(f).split(' ')[1] + def new_func(*args, **kwargs): + t = time.ticks_us() + result = f(*args, **kwargs) + delta = time.ticks_diff(t, time.ticks_us()) + print('Function {} Time = {:6.3f}ms'.format(myname, delta/1000)) + return result + return new_func + +MicroPython code improvements +----------------------------- + +The const() declaration +~~~~~~~~~~~~~~~~~~~~~~~ + +MicroPython provides a ``const()`` declaration. This works in a similar way +to ``#define`` in C in that when the code is compiled to bytecode the compiler +substitutes the numeric value for the identifier. This avoids a dictionary +lookup at runtime. The argument to ``const()`` may be anything which, at +compile time, evaluates to an integer e.g. ``0x100`` or ``1 << 8``. + +.. _Caching: + +Caching object references +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Where a function or method repeatedly accesses objects performance is improved +by caching the object in a local variable: + +.. code:: python + + class foo(object): + def __init__(self): + ba = bytearray(100) + def bar(self, obj_display): + ba_ref = self.ba + fb = obj_display.framebuffer + # iterative code using these two objects + +This avoids the need repeatedly to look up ``self.ba`` and ``obj_display.framebuffer`` +in the body of the method ``bar()``. + +.. _gc: + +Controlling garbage collection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When memory allocation is required, MicroPython attempts to locate an adequately +sized block on the heap. This may fail, usually because the heap is cluttered +with objects which are no longer referenced by code. If a failure occurs, the +process known as garbage collection reclaims the memory used by these redundant +objects and the allocation is then tried again - a process which can take several +milliseconds. + +There are benefits in pre-empting this by periodically issuing ``gc.collect()``. +Firstly doing a collection before it is actually required is quicker - typically on the +order of 1ms if done frequently. Secondly you can determine the point in code +where this time is used rather than have a longer delay occur at random points, +possibly in a speed critical section. Finally performing collections regularly +can reduce fragmentation in the heap. Severe fragmentation can lead to +non-recoverable allocation failures. + +Accessing hardware directly +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This comes into the category of more advanced programming and involves some knowledge +of the target MCU. Consider the example of toggling an output pin on the Pyboard. The +standard approach would be to write + +.. code:: python + + mypin.value(mypin.value() ^ 1) # mypin was instantiated as an output pin + +This involves the overhead of two calls to the ``Pin`` instance's ``value()`` +method. This overhead can be eliminated by performing a read/write to the relevant bit +of the chip's GPIO port output data register (odr). To facilitate this the ``stm`` +module provides a set of constants providing the addresses of the relevant registers. +A fast toggle of pin ``P4`` (CPU pin ``A14``) - corresponding to the green LED - +can be performed as follows: + +.. code:: python + + BIT14 = const(1 << 14) + stm.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14 + +The Native code emitter +----------------------- + +This causes the MicroPython compiler to emit ARM native opcodes rather than +bytecode. It covers the bulk of the Python language so most functions will require +no adaptation (but see below). It is invoked by means of a function decorator: + +.. code:: python + + @micropython.native + def foo(self, arg): + buf = self.linebuf # Cached object + # code + +There are certain limitations in the current implementation of the native code emitter. + +* Context managers are not supported (the ``with`` statement). +* Generators are not supported. +* If ``raise`` is used an argument must be supplied. + +The trade-off for the improved performance (roughly twices as fast as bytecode) is an +increase in compiled code size. + +The Viper code emitter +---------------------- + +The optimisations discussed above involve standards-compliant Python code. The +Viper code emitter is not fully compliant. It supports special Viper native data types +in pursuit of performance. Integer processing is non-compliant because it uses machine +words: arithmetic on 32 bit hardware is performed modulo 2**32. + +Like the Native emitter Viper produces machine instructions but further optimisations +are performed, substantially increasing performance especially for integer arithmetic and +bit manipulations. It is invoked using a decorator: + +.. code:: python + + @micropython.viper + def foo(self, arg: int) -> int: + # code + +As the above fragment illustrates it is beneficial to use Python type hints to assist the Viper optimiser. +Type hints provide information on the data types of arguments and of the return value; these +are a standard Python language feature formally defined here `PEP0484 <https://www.python.org/dev/peps/pep-0484/>`_. +Viper supports its own set of types namely ``int``, ``uint`` (unsigned integer), ``ptr``, ``ptr8``, +``ptr16`` and ``ptr32``. The ``ptrX`` types are discussed below. Currently the ``uint`` type serves +a single purpose: as a type hint for a function return value. If such a function returns ``0xffffffff`` +Python will interpret the result as 2**32 -1 rather than as -1. + +In addition to the restrictions imposed by the native emitter the following constraints apply: + +* Functions may have up to four arguments. +* Default argument values are not permitted. +* Floating point may be used but is not optimised. + +Viper provides pointer types to assist the optimiser. These comprise + +* ``ptr`` Pointer to an object. +* ``ptr8`` Points to a byte. +* ``ptr16`` Points to a 16 bit half-word. +* ``ptr32`` Points to a 32 bit machine word. + +The concept of a pointer may be unfamiliar to Python programmers. It has similarities +to a Python ``memoryview`` object in that it provides direct access to data stored in memory. +Items are accessed using subscript notation, but slices are not supported: a pointer can return +a single item only. Its purpose is to provide fast random access to data stored in contiguous +memory locations - such as data stored in objects which support the buffer protocol, and +memory-mapped peripheral registers in a microcontroller. It should be noted that programming +using pointers is hazardous: bounds checking is not performed and the compiler does nothing to +prevent buffer overrun errors. + +Typical usage is to cache variables: + +.. code:: python + + @micropython.viper + def foo(self, arg: int) -> int: + buf = ptr8(self.linebuf) # self.linebuf is a bytearray or bytes object + for x in range(20, 30): + bar = buf[x] # Access a data item through the pointer + # code omitted + +In this instance the compiler "knows" that ``buf`` is the address of an array of bytes; +it can emit code to rapidly compute the address of ``buf[x]`` at runtime. Where casts are +used to convert objects to Viper native types these should be performed at the start of +the function rather than in critical timing loops as the cast operation can take several +microseconds. The rules for casting are as follows: + +* Casting operators are currently: ``int``, ``bool``, ``uint``, ``ptr``, ``ptr8``, ``ptr16`` and ``ptr32``. +* The result of a cast will be a native Viper variable. +* Arguments to a cast can be a Python object or a native Viper variable. +* If argument is a native Viper variable, then cast is a no-op (i.e. costs nothing at runtime) + that just changes the type (e.g. from ``uint`` to ``ptr8``) so that you can then store/load + using this pointer. +* If the argument is a Python object and the cast is ``int`` or ``uint``, then the Python object + must be of integral type and the value of that integral object is returned. +* The argument to a bool cast must be integral type (boolean or integer); when used as a return + type the viper function will return True or False objects. +* If the argument is a Python object and the cast is ``ptr``, ``ptr``, ``ptr16`` or ``ptr32``, + then the Python object must either have the buffer protocol with read-write capabilities + (in which case a pointer to the start of the buffer is returned) or it must be of integral + type (in which case the value of that integral object is returned). + +The following example illustrates the use of a ``ptr16`` cast to toggle pin X1 ``n`` times: + +.. code:: python + + BIT0 = const(1) + @micropython.viper + def toggle_n(n: int): + odr = ptr16(stm.GPIOA + stm.GPIO_ODR) + for _ in range(n): + odr[0] ^= BIT0 + +A detailed technical description of the three code emitters may be found +on Kickstarter here `Note 1 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/664832>`_ +and here `Note 2 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/665145>`_ |