Introduction to Computer Vision

April 28, 2017

Roadmap

What is computer vision?
Image interpretation
Feature detection
Motion & tracking
Machine learning

Computer Vision

"Computer vision is the science of endowing computers or other machines with vision, or the ability to see."

Erik Learned-Miller, UMass

You've used CV before

Snapchat filters
Facebook tagging
Google Street View face blurring

Applications of CV

Computer Vision in Sports

What even is an image?

An image is a function \(f(x,y): \mathbb{R}^2 \rightarrow \mathbb{R}\), giving an intensity value. \(f: [a,b]x[c,d] \rightarrow [min,max]\), where \(a, b,c,d \in \mathbb{R}\)

Color images: \(f: \mathbb{R}^2 \rightarrow \mathbb{R}^3\), \(f(x,y) = [r(x,y),g(x,y),b(x,y)]\)

Images are made up of pixels, or picture elements.

Images as functions

"Computer vision is about taking these functions and computing something from them." (statistics!)

Smoothing = Blurring

Smoothing an image function is equivalent to blurring the image

Edge detection

An edge is a place of rapid change in the image intensity function.
Sometimes edges are all we need to convey what's going on in the image.

Edges are intense

How do we measure change in function values? Derivatives!

Intensity gradient

We have a 2D function, so we get a 2D gradient vector. The gradient points in the direction of most rapid increase in intensity.

Tiger example

Lena's edges

How do we take an edge image?

Lena's edges

Step 1. Smooth derivatives to suppress noise and compute gradient.

Lena's edges

Step 2. Threshold to find regions of "significant" gradient.

Lena's edges

Step 3. "Thin" to get localized edge pixels.

Lena's edges

Step 4. Link/connect edge pixels

Human Vision

Human Vision Takeaway

structure and depth are ambiguous from single views
we need some way to figure out how "far away" things are in the world
what if we took multiple pictures of the scene from different viewpoints?

What is an image (now)?

until now: a function, 2D pattern of intensity values
now: a 2D projection of 3D points
a camera is a device that takes projection of light into some medium (film, sensor, etc.)

Stereo Vision

Goal: use relationship between multiple views to recover depth

We need to consider

corresponding image points
information from camera pose

Disparity Map

Correspondence Problem

We want to know which points correspond between two images. Let's assume that

most points in the scene are visible from both images
image regions for the match are similar in appearance

Epipolar Constraint

The geometry constrains where the corresponding pixel for a point in the first view can occur in the second view.

(Dense) Correspondence Search

For each pixel/window in the left image,

compare with every pixel/window along the same epipolar line in the right image
choose point with minimum cost

Correspondence considerations

angle between image planes
uniqueness of correspondence
size of window used
occlusions

Camera calibration

We would like to know the relationship between coordinates in the image and coordinates in the real world.

This involves 2 types of transformations:

Extrinsic: from 3D world to image
- parameters: translation (X or Y), rotation
Intrinsic: from image to 3D world
- parameters: focal length, skew, aspect ratio, offets (x or y)

Plane Transformations

Different transformations need different #'s of points to solve.

Back to correspondence

Our goal is to find points in an image that can be

found in other images
found precisely - well localized
found reliably - well matched

Matching features

Panoramas

Motion

A video is a sequence of frames captured over time (quickly).

Our image data is now a function of space (\(x,y\)) and time (\(t\)).

Motion in Computer Vision

background subtraction
shot boundary detection
motion segmentation
recognizing events and activities
improving video quality (motion stabilization)

Motion Estimation

Feature-based methods

extract visual features (corners) and track them over multiple frames
sparse motion fields, but more robust tracking

Direct, dense methods

directly recover image motion at each pixel from spatio-temporal image brightness variations
dense motion fields, but sensitive to appearance variations

Optical Flow

A motion field is the actual, physical motion of an object or surface. Optical flow is its apparent motion.

Lucas-Kanade Method

Basic idea: impose local constraints to get more equations for a pixel

pretend the pixel's neighbors have the same motion
very related to eigenvalue ratio, Harris corner classification
can estimate motion really well in "corners"

Issue: motion is usually large (larger than a pixel)

Example

Gaussian Pyramids

Tracking

Detection is locating an object independently in each frame.

Tracking uses dynamics to predict the new location of the object in the next frame, then updates that prediction based on measurements.

Tracking Challenges

in many places, it's hard to compute optical flow
there can be large displacements since objects could move rapidly
errors can compound, or drift
occlusions, disocclusions

Tracking as Inference

hidden state (x): true parameters we care about
measurement (Y): noisy observation of underlying state
at each time step \(t\), state changes from \(x_{t-1}\) to \(x_t\) and we get a new observation \(Y_t\)

Goal: recover distribution of state \(x_t\) given

all observations seen so far
knowledge about dynamics of state transitions

Tracking steps

prediction: what is the next state of the object given past measurements? \(P(x_t | y_0,...,y_{t-1})\)
correction: compute an obdated estimate of the state from prediction and measurments \(P(x_t | y_0,...,y_t)\)
tracking: the process of propagating this posterior distribution of state given measurements across time

We use probability distributions to allow for noise and missing frames

Kalman Filter

a method of tracking linear dynamical systems with Gaussian noise
both the predicted states and the corrected states are Gaussian

Bayes Filters

Given

prior probability of the system state \(p(x)\)
action (dynamical system) model: \(p(x_t|u_{t-1},x_{t-1})\)
sensor model (likelihood) \(p(z|x)\)
- if object was really at a place, what is the likelihood of my measurement?
stream of observations z and action data u, \({u_1,z_2,...,u_{t-1},z_t}\)

We want to estimate the state \(x\) at time \(t\)

Particle Filters

maintains a probability distribution over the state
represented as a set of weighted samples (particles)
higher weight placed where object is more likely to be

Person tracking

Semantics, Shemantics

Everything up until now has been semantic-free: no knowledge of what was in the image

Recognition means labeling objects in a way that humans would understand.

verification: is that thing a lamp?
detection: are there any people here?
identification: is that SAS Hall?
object categorization: labels like tree, banner, mountain
scene and context categorization: outdoor, city, market

Object Categorization

generic categorization: cars
instance-level recognition: Marschall's car
task: given a (small) # of training images of a category, recognize instances of that category and assign the correct category label
problem: a single object doesn't have a unique label

Categorization is tricky

realistic scenes are crowded, cluttered, and have overlapping objects
robustness: illumination, object pose, occlusions, viewpoint
complexity: volume of pixels, images, categories
importance of context

Supervised Classification

Given labeled examples, predict labels for new examples

Two general strategies:

Generative: model each class
Discriminative: model boundaries between classes

Generative Models

For a given measurement \(x\) and set of classes \(c_i\), choose \(c^*\) by \(c^* = max\{p(c|x)\}=max\{p(c)p(x|c)\}\)

If \(x\) is continuous, we need a likelihood density model of \(p(x|c)\)

This model is typically parametric, doesn't take much data to fit. - think Gaussian or mixture of Gaussians

Principal Components Analysis

Principal components are all about the directions in a feature space along which points have the greatest variance

Discriminative Models

Train

build an object model, a representation
learn/train a classifier

Test

generate candidates in new image
score the candidates

Discriminative Classifiers

nearest neighbor: Shakhnarovich 2003, Berg 2005
neural networks: LeCun 1998, Rowley 1998
SVMs: Guyon 2001
Boosting: Viola 2001, Torralba 2004, Opelt 2006
Random Forests: Breiman 1984, SHotton 2008

Viola-Jones Face Detection

"Rapid Object Detection using a Boosted Cascade of Simple Features" - Viola-Jones, 2001

Bag of Visual Words

Goal: summarize entire image based on its distribution of "word" occurrences

Stuff I didn't cover

color images
video analysis
scene understanding
openCV