This document discusses three main techniques for image captioning using deep neural networks: CNN-RNN based, CNN-CNN based, and reinforcement learning based frameworks. CNN-RNN based frameworks typically use a CNN as an image encoder to extract visual features, and an RNN like LSTM as a decoder to generate captions word-by-word. CNN-CNN based frameworks replace the RNN decoder with a CNN, allowing for faster training times. Reinforcement learning based frameworks formulate image captioning as a sequential decision making problem to optimize caption quality. The document also covers challenges faced and recent improvements in these areas.