Thursday, August 30, 2018 7:53 am

Intro to Embedded Profiling and Performance Optimization

Written by David Linn
Rate this item
(2 votes)

If you need to speed up computations on your NetBurner module but don’t know where to start, you’re in the right place. During my intern project developing a 1/8-scale autonomous vehicle, I hit a brick wall when computationally intensive tasks started to fail. Instead of accepting that I had reached the limit of the board’s performance, I decided to buckle down and managed to decrease the runtime of computationally intensive tasks by an order of magnitude. This article will discuss the simplest things we can do to measure and improve performance. The provided code is designed to run on a NetBurner module but techniques discussed can be applied to any embedded development project.

Profiling and Prioritizing

Profiling and prioritizing should be the first step when embedded performance concerns arise. In a complex program, the things we think are using up the most computational power often aren’t. It’s a massive waste of effort to optimize a code block that takes nanoseconds when there’s code that’s taking milliseconds. Similarly, our effort is much better spent reducing execution time on code run 100 times a second vs. code run only once a second.

A good starting point is simply measuring the execution time of various sections of code. You can download a class that helps with methodically measuring execution time at my repo []. This profiler is easy to use and can help you get information on where and how to focus optimization efforts. However, it does not account for breaks in code execution from interrupts and switching to higher priority tasks so be mindful of that when interpreting results. For more advanced profiling, see the examples in Nburn\examples\StandardStack\Profiling.

Below is the baseline for my autonomous vehicle project running on a NetBurner NANO 54415. Note that the percentages, which represent the time spent in a block of code divided by the total sampling time, add up to over 100%. This is because some blocks of code tested are nested in other blocks, and execution times are artificially inflated from not considering task switching and interrupts.






0: Steering





1: Throttle





2: Navigation Update





3: Drive





4: LCD Update





5: IMU Update and Madgwick Filter





6: Spinning LiDAR Read





7: Side Lidar Read





8: Madgwick Filter






The first thing I noticed was that “1: Throttle” and “6: Spinning LiDAR Read” didn’t run at all during the 15-second sampling. It appeared that “5: IMU Update and Madgwick Filter,” which calculates the vehicle’s orientation, was taking up so much processing power that other computationally intensive tasks failed. In the next part, I’ll discuss the various optimizations I performed to fix this issue.


Once we know the blocks of code we want to speed up, we should run experiments to see how much hypothesized optimizations actually decrease execution time. I did this by evaluating improvement on a small block of test code before spending time changing a large block of project code. Remember that micro-optimization can lead us into a rabbit hole; we should constantly re-evaluate whether continuing to optimize is worth the programming effort. I compiled the results of optimizations I made to my autonomous vehicle project below to give you a vague idea of what changes might be beneficial:



MAX Pct. Change in Execution Time

Put ALL running tasks in SRAM




Declare global variables in SRAM



Changing int16_t and int8_t to int



Array of pre-calculated sin and cos



*ONLY for one block of code that used many variables, changes to other blocks were negligible or even saw an increase of up to 15%
**This involved code that interfaced with a 16-bit and 8-bit peripheral. Use the peripheral’s word size!

Although it is very specific to a single project, the data highlights three simple changes that could lead to significant performance improvements:

  1. Putting task stacks in SRAM (because we have on-die SRAM with single-cycle access, as opposed to the main DDR system memory)
  2. Declaring global variables in SRAM
  3. Pre-calculating commonly called functions

Here’s how to do these on a NetBurner module:

  1. To create tasks in SRAM, call OSSimpleTaskCreatewNameSRAM() instead of OSSimpleTaskCreatewName(). Because switching to and from the idle task is also significant, uncomment #define FAST_IDLE_STACK in constants.h, and rebuild the system libraries for the change to take effect.
  2. To declare user variables in SRAM, simply add the macro FAST_USER_VAR after the declaration (e.g. int foo FAST_USER_VAR;).
  3. To pre-calculate sin and cos, I created a global array of 4096 floats and populated it as the board loaded up using std∷sin(), which took about half a second. I then wrote fastsin(), fastcos(), and fasttan() functions which took inputs in degrees to prevent the multiple conversions that were occurring. The performance of the autonomous vehicle looked much better after all the changes, leaving plenty of idle time (sum of PctTotal < 100):

One last note: floating-point adds and multiplies take about twice as long as integer adds and multiplies on the NANO. However, converting a value that should be represented as a float to an int to perform a few operations and then converting back to a float does not seem to be worth it because of type conversion time and decreased readability.

If you’re interested in the road performance of the NetBurner Autonomous Vehicle in addition to the computational performance, keep an eye out for some more articles in September! You can also check out the epic story from our 2017 Spark Fun Autonomous Vehicle Challenge on our blog.

Read 387 times Last modified on Thursday, August 30, 2018 8:48 am

Leave a comment

NetBurner Learn

The NetBurner Learn website is a place to learn faster ways to design, code, and build your NetBurner based product. Sign-up for our monthly newsletter!

Latest Articles

We use cookies to help us provide a better user experience.