A Novel Attention-based Aggregation Function to Combine Vision and Language