Would having surface normals simplify the depth estimation of an image? Do visual tasks have a relationship, or are they unrelated? Common sense suggests that visual tasks are interdependent, implying the existence of structure among tasks. However, a proper model is needed for the structure to be actionable, e.g., to reduce the supervision required by utilizing task relationships. We therefore ask: which tasks transfer to an arbitrary target task, and how well? Or, how do we learn a set of tasks collectively, with less total supervision?
These are some of the questions that can be answered by a computational model of the vision tasks space, as proposed in this paper. We explore the task structure utilizing a sampled dictionary of 2D, 2.5D, 3D, and semantic tasks, and modeling their (1st and higher order) transfer behaviors in a latent space. The product can be viewed as a computational taxonomy and a map of the task space. We study the consequences of this structure, e.g., the emerging task relationships, and exploit them to reduce supervision demand. For instance, we show that the total number of labeled datapoints needed to solve a set of 10 tasks can be reduced to 1/4 while keeping performance nearly the same by using features from multiple proxy tasks. Users can employ a provided Binary Integer Programming solver that leverages the taxonomy to find efficient supervision policies for their own use cases.
Process overview. The steps involved in creating the taxonomy.
How we trained each task-specific network. The data, architectures, and training regimen.(more soon)
The methodology behind transfers. What goes in to the representations and how they are combined.(more soon)
Analyzing the transfer results and synthesizing them into the API. Methodology for the BIP solver.(more soon)
Zamir, Sax*, Shen*, Guibas, Malik, Savarese.
Taskonomy: Disentangling Task Transfer Learning.
In submission to CVPR, 2017.
Please cite the paper if you use the methods, models, database, or API.
The provided API uses our results to recommend a superior set of transfers. By using these transfers, we can get similar results to a fully supervised network trained on 4x the data.
Example taxonomies. Generated from the API.
We have an extremely large and extremely high-quality dataset of varied indoor scenes.
Complete pixel-level geometric information via aligned meshes.
Globally consistent camera poses. Complete camera intrinsics.
3x times big as ImageNet.
* If you are interested in using the full dataset (12 TB), then please contact the authors.
We measure the effectiveness of our networks using two different metrics.
Taxonomy significance. The green line is our taxonomy, and the grey lines show the performance of a random feasible solution (error bars show standard deviation).