Toothed whales frequently use echolocation clicks to communicate, co-ordinate, and forage. There is a regularity in the timing between clicks, so if one hears a dolphin’s click in an audio recording, it is prudent to assume that the next few hundred milliseconds have a higher than average probability of containing another click. Additionally, the use of neural networks in the domain of speech and audio processing has seen a lot of success. State of the art performance in human speech recognition has been achieved with the help of deep recurrent neural networks. In this thesis, a two stage neural network architecture is presented which exploits a dolphin click’s contextual information. The first stage extracts potential clicks from an audio stream. This is fed to the second stage network which uses audio waveform data of these potential clicks along with contextual information about previous and next clicks to refine the results. The performance of this two stage network is compared with the performance of the first stage network to show that the presented two stage architecture can improve the precision and recall. When applied over hours of data to automatically recognize dolphin clicks, the rate of false positives is also improved over the single stage network.