Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach