Visual understanding is the abstracting of high-dimensional visual signals like images and videos. Many problems are involved in this process, ranging from depth prediction and vision-language correspondence to classification and object grounding, which include tasks defined along spatial and temporal axes and tasks defined along coarse to fine granularity, like object grounding. In light of…
		 
						
									
			 
				 
								 
						
									 
						
									 
						
									 
						
									 
						
									 
						
									 
						
									 
						
									