<?xml version="1.0" encoding="UTF-8"?><xml><records><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>47</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Sayed, N.</style></author><author><style face="normal" font="default" size="100%">Biagio Brattoli</style></author><author><style face="normal" font="default" size="100%">Björn Ommer</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Cross and Learn: Cross-Modal Self-Supervision</style></title><secondary-title><style face="normal" font="default" size="100%">German Conference on Pattern Recognition (GCPR) (Oral)</style></secondary-title></titles><keywords><keyword><style  face="normal" font="default" size="100%">action recognition</style></keyword><keyword><style  face="normal" font="default" size="100%">cross-modal</style></keyword><keyword><style  face="normal" font="default" size="100%">image understanding</style></keyword><keyword><style  face="normal" font="default" size="100%">unsupervised learning</style></keyword></keywords><dates><year><style  face="normal" font="default" size="100%">2018</style></year></dates><urls><web-urls><url><style face="normal" font="default" size="100%">https://arxiv.org/abs/1811.03879v1</style></url></web-urls></urls><pub-location><style face="normal" font="default" size="100%">Stuttgart, Germany</style></pub-location><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">In this paper we present a self-supervised method to learn feature representations for different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our method we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We also show the transferability of our feature representations and conduct extensive ablation studies to validate our core contributions.</style></abstract></record></records></xml>