Debug like a Warrior. Part II

warrior

At NetBurner we do our best to deliver a turnkey experience that is easy, fast and as seamless as possible. Yet, there are times when you need to sharpen your sword and drive into battle against hordes of… Bugs! This doesn’t happen in most simple out of the box applications; however, one of the advantages with NetBurner is that we enable cavalier and experienced engineers to develop and tailor the embedded software to get customized functionality. This can sometimes introduce some less predictable and difficult-to-eradicate bugs, gremlins, goblins or demons. In this sequel, we’ll pick up with a few more techniques that will further train and empower you to recognize and exterminate those digital foes like a true warrior. If you missed the first part of this article you can catch up here.

“A warrior must only take care that his spirit is never broken.”Chozan Shissai

The Network died now what?

Having your network die is like a warhorse laying down just as you prepare to storm into battle — sword drawn. Let’s sort this dire situation out like a warrior. The very first step is to figure out what died. Here’s a handy block flow diagram to get you through this. The subsequent sections will address some of the less obvious blocks in the diagram.

Out of Buffers

The entire network stack works by passing around pre-allocated buffers. This is a limited resource and misbehaving code can use them up. To test the buffer, add the following debugging code which will repost the buffer free count. This can be added in a number of different areas of code, depending on the structure of the application. One possible solution is to put it inside the infinite loop in UserMain(), and use a timer to have it called every so often.

#include 
//Exposes the function :   WORD GetFreeCount();
//Then some place in your code …
iprintf(“Free buffers = %drn”,(int)GetFreeCount());

Now run this and carefully watch the messages. If this number is decreasing every time the module reports it and the system stops responding when it goes to zero…. DING! This is your issue.

Why buffers might go to zero

The primary cause of buffer loss (filling buffers from the available pool and failing to release them) is mishandled UDP sockets or UDP registered FIFO’s. If you set up a UDP socket to receive, or if you use the UDP object class and register a FIFO to receive UDP packets, then you must read them. If you’re not reading them, then unread UDP packets will accumulate in the UDP receive queue until you have run out of buffers. Alternately, you can also exhaust all buffers by creating a UDP transmit loop that does not wait until the packet has completely gone out on the wire before creating another packet.

A secondary, less common buffer loss is due to trying to do I/O within a user written interrupt routine. There are several things that you may not do from within an interrupt routine. Calling any of the functions listed below inside of an interrupt routine can get you into some misery:

  • All μC/OS critical section functions:
    • USER_ENTER_CRITICAL
    • USER_EXIT_CRITICAL
    • UCOS_ENTER_CRITICAL
    • UCOS_EXIT_CRITICAL
  • All μC/OS init and pend functions (all OsxxPendNoWait functions are okay):
    • OSxxInit
    • OSxxPend
    • OSCritEnter
    • OSChangePrio
    • OSTaskDelete
    • OSLock
    • OSUnlock
    • OSTaskCreate
    • OSTimeDly
  • I/O Functions:
    • write
    • writeall
    • read
    • printf
    • fprintf
    • iprintf
    • scanf
    • gets
    • puts
  • Memory management functions:
    • malloc
    • free
    • new
    • delete

As you can see from the list above, trying to use any sort of I/O functions can cause a lot of headaches. One of the possible failures from doing this is to interfere with the buffer handling queues, which will prevent used buffers from getting released.

From time to time, we have also found internal NetBurner bugs that lost packets on strange corner-cases. And because of our upcoming beta releases of the new WiFi and SSL drivers, it’s possible you found a path to lose buffers we did not find. If this is the case, thank you! Also, please report it to support@netburner.com.

Is the Network Address and mask correct?

The system can get network addresses with DHCP or it can use static settings. A full network address consists of four things:

  1. IP Address
  2. Network Mask
  3. Gateway address (only needed if you are routing to addresses off the local network)
  4. DNS address (only needed if you are addressing things with names, i.e. www.NetBurner.com instead of 34.199.141.96)

The validity of these settings is covered in detail in the NetBurner Programmer’s Guide.

Here’s an abridged version: The bit operation “AND” of the IP Address and the Network Mask must equal the bit operation “AND” of the Gateway address and the Network Mask. In other words:

(ip & mask) == (gateway & mask) // This must be true

Otherwise, the gateway is not on the local network. DNS must have a valid value.

Look at Tasks

The previous article in this series gave you a way to look at what tasks are doing using the TaskScan utility. Unfortunately, if you have a user created high-priority task that never yields or blocks the network, that method is not going to work. If you end up in this block of the flow diagram, you’ll want to create a dump of the TaskScan or textual task dump results , open a NetBurner support ticket with the details of what’s happening, and we will be happy to help you figure out what’s going on.

The Webserver Dies

Occasionally, the webserver running on your NetBurner may hit a bump in the road and get locked up. While it’s not quite as troubling as having your warhorse laying down on your right before a fight, it is a lot like showing up to the battlefield and realizing you left your sword with your other suit of armor. The chart below walks you through debugging the most common pitfalls encountered when this beast rears its ugly head.

Look at timages/learn/webserverdied.pnghe TCP socket loss reports.

Add the following code to your project:

#include <../system/tcp_internal.h>
//Then call
       socket_struct::ShowSocketInfo(false);

This is going to dump some information about each opensocket. It will look something like this:

Socket[32]:
  State: ESTABLISED
  gpFlags: 0x80
  myIP: 10.1.1.170 - theirIP: 10.1.1.159
  myPort: 80 - theirPort: 56777
  max_listen: 0 - cur_listen: 0
  pNextActive: 0x0 - pPrevActive: 0x0
  NextTimeAction: 4361
  theirWindow: 3532485379 - myWindow: 3517094507
  TxBuff Depth: 0 - RxBuff Depth: 0
Socket[36]:
  State: LISTEN

The most interesting value is the “State”. The NetBurner system by default has 32 sockets. If all 32 sockets have been listed in this report, you are out of TCP sockets and need to figure out why the system is out of sockets. Again, if you get to this point and nothing is obvious, include this data and contact NetBurner support and we will help you dig deeper.

Your outbound TCP connection dies

Ok, we’re running out of medieval warfare metaphors here! Needless to say, having your TCP connection fail on you can be extremely frustrating. Not only will it impact your application, but several of the NetBurner tools rely on TCP network communication, and being stripped of those can often make you feel vulnerable and alone… kind of like showing up to battle with no horse, armor or sword (looks like we found one after all). The chart below takes you through the steps needed to address the most common causes for TCP connection failure to help you pinpoint exactly what’s going on.

Notes about TCP keep alive.

By design a TCP connection does not send anything over the wire if there is nothing to be sent. What this means is if you have a TCP connection to a remote device and the power goes off, the cable gets unplugged or the building is overrun by the Night King and the army of the dead, the TCP connection will have no way of knowing the other side of the connection is dead, unless it tries to send data.

There are two ways to see if a socket is dead:

  1. Send it some data.
  2. Send it a keepalive request: The details of the keepalive are spelled out and working code is given as an example in nburn/examples/standardstack/tcp/TCP_simple_keepalive. Here’s a quick code snippet…
#include 
//Find out the last time we received any kind of packet from the other side of the connection.
DWORD TcpGetLastRxTime( int fd );//Returns the number of TimeTicks since the last RX
//Send a keep alive the other side should respond.
void TcpSendKeepAlive ( int fd );

One could write many volumes on how to debug Network issues. This article is necessarily brief.

Further Support

If you are having network issues the best path forward is to go down these flow charts until you get confused or the results are unclear. Then open a support request with as much information as possible about what is wrong, including the steps of what you have done to debug this yourself. As part of your support request please include as a minimum:

  • What is connected to what.
  • Did it ever work?
  • How long it ran before it died.
  • And as much of the diagnostic flow as you have accomplished.

As always if there is something you’d like us to expand upon or have questions about please feel free to comment below or contact us directly. We hope this has hardened your skills as you fight the good fight. From one embedded warrior to another — Tallyho! And remember:

“Warriors do not win victories by beating their heads against walls, but by overtaking the walls. Warriors jump over walls; they don’t demolish .hem” — The Teachings of Don Juan, Carlos Castaneda

Share this post

Subscribe to our Newsletter

Get monthly updates from our Learn Blog with the latest in IoT and Embedded technology news, trends, tutorial and best practices. Or just opt in for product change notifications.

Leave a Reply
Click to access the login or register cheese