April 28, 2017

Roadmap

  • What is computer vision?
  • Image interpretation
  • Feature detection
  • Motion & tracking
  • Machine learning

Computer Vision

"Computer vision is the science of endowing computers or other machines with vision, or the ability to see."

  • Erik Learned-Miller, UMass

You've used CV before

  • Snapchat filters
  • Facebook tagging
  • Google Street View face blurring

Applications of CV

Computer Vision in Sports

What even is an image?

An image is a function \(f(x,y): \mathbb{R}^2 \rightarrow \mathbb{R}\), giving an intensity value. \(f: [a,b]x[c,d] \rightarrow [min,max]\), where \(a, b,c,d \in \mathbb{R}\)

Color images: \(f: \mathbb{R}^2 \rightarrow \mathbb{R}^3\), \(f(x,y) = [r(x,y),g(x,y),b(x,y)]\)

Images are made up of pixels, or picture elements.

Images as functions

"Computer vision is about taking these functions and computing something from them." (statistics!)

Smoothing = Blurring

Smoothing an image function is equivalent to blurring the image

Edge detection

  • An edge is a place of rapid change in the image intensity function.
  • Sometimes edges are all we need to convey what's going on in the image.

Edges are intense

How do we measure change in function values? Derivatives!

Intensity gradient

We have a 2D function, so we get a 2D gradient vector. The gradient points in the direction of most rapid increase in intensity.

Tiger example

Lena's edges

How do we take an edge image?

Lena's edges

Step 1. Smooth derivatives to suppress noise and compute gradient.

Lena's edges

Step 2. Threshold to find regions of "significant" gradient.

Lena's edges

Step 3. "Thin" to get localized edge pixels.

Lena's edges

Step 4. Link/connect edge pixels

Human Vision

Human Vision

Human Vision

Human Vision

Human Vision Takeaway

  • structure and depth are ambiguous from single views
  • we need some way to figure out how "far away" things are in the world
  • what if we took multiple pictures of the scene from different viewpoints?

What is an image (now)?

  • until now: a function, 2D pattern of intensity values
  • now: a 2D projection of 3D points
  • a camera is a device that takes projection of light into some medium (film, sensor, etc.)

Stereo Vision

Goal: use relationship between multiple views to recover depth

We need to consider

  1. corresponding image points
  2. information from camera pose

Disparity Map

Disparity Map

Correspondence Problem

We want to know which points correspond between two images. Let's assume that

  • most points in the scene are visible from both images
  • image regions for the match are similar in appearance

Epipolar Constraint

The geometry constrains where the corresponding pixel for a point in the first view can occur in the second view.

(Dense) Correspondence Search

Correspondence considerations

  • angle between image planes
  • uniqueness of correspondence
  • size of window used
  • occlusions

Camera calibration

We would like to know the relationship between coordinates in the image and coordinates in the real world.

This involves 2 types of transformations:

  • Extrinsic: from 3D world to image
    • parameters: translation (X or Y), rotation
  • Intrinsic: from image to 3D world
    • parameters: focal length, skew, aspect ratio, offets (x or y)

Plane Transformations

Different transformations need different #'s of points to solve.

Back to correspondence

Our goal is to find points in an image that can be

  • found in other images
  • found precisely - well localized
  • found reliably - well matched

Matching features

Matching features

Panoramas

Motion

A video is a sequence of frames captured over time (quickly).

Our image data is now a function of space (\(x,y\)) and time (\(t\)).

Motion in Computer Vision

  • background subtraction
  • shot boundary detection
  • motion segmentation
  • recognizing events and activities
  • improving video quality (motion stabilization)

Motion Estimation

Feature-based methods

  • extract visual features (corners) and track them over multiple frames
  • sparse motion fields, but more robust tracking

Direct, dense methods

  • directly recover image motion at each pixel from spatio-temporal image brightness variations
  • dense motion fields, but sensitive to appearance variations

Optical Flow

A motion field is the actual, physical motion of an object or surface. Optical flow is its apparent motion.

Lucas-Kanade Method

Basic idea: impose local constraints to get more equations for a pixel

  • pretend the pixel's neighbors have the same motion
  • very related to eigenvalue ratio, Harris corner classification
  • can estimate motion really well in "corners"

Issue: motion is usually large (larger than a pixel)

Example

Gaussian Pyramids

Tracking

Detection is locating an object independently in each frame.

Tracking uses dynamics to predict the new location of the object in the next frame, then updates that prediction based on measurements.

Tracking Challenges

  • in many places, it's hard to compute optical flow
  • there can be large displacements since objects could move rapidly
  • errors can compound, or drift
  • occlusions, disocclusions

Tracking as Inference

  • hidden state (x): true parameters we care about
  • measurement (Y): noisy observation of underlying state
  • at each time step \(t\), state changes from \(x_{t-1}\) to \(x_t\) and we get a new observation \(Y_t\)

Goal: recover distribution of state \(x_t\) given

  • all observations seen so far
  • knowledge about dynamics of state transitions

Tracking steps

  • prediction: what is the next state of the object given past measurements? \(P(x_t | y_0,...,y_{t-1})\)
  • correction: compute an obdated estimate of the state from prediction and measurments \(P(x_t | y_0,...,y_t)\)
  • tracking: the process of propagating this posterior distribution of state given measurements across time

We use probability distributions to allow for noise and missing frames

Kalman Filter

  • a method of tracking linear dynamical systems with Gaussian noise
  • both the predicted states and the corrected states are Gaussian

Bayes Filters

Given

  1. prior probability of the system state \(p(x)\)
  2. action (dynamical system) model: \(p(x_t|u_{t-1},x_{t-1})\)
  3. sensor model (likelihood) \(p(z|x)\)
    • if object was really at a place, what is the likelihood of my measurement?
  4. stream of observations z and action data u, \({u_1,z_2,...,u_{t-1},z_t}\)

We want to estimate the state \(x\) at time \(t\)

Particle Filters

  • maintains a probability distribution over the state
  • represented as a set of weighted samples (particles)
  • higher weight placed where object is more likely to be

Person tracking

Semantics, Shemantics

Everything up until now has been semantic-free: no knowledge of what was in the image

Recognition means labeling objects in a way that humans would understand.

  • verification: is that thing a lamp?
  • detection: are there any people here?
  • identification: is that SAS Hall?
  • object categorization: labels like tree, banner, mountain
  • scene and context categorization: outdoor, city, market

Object Categorization

  • generic categorization: cars
  • instance-level recognition: Marschall's car
  • task: given a (small) # of training images of a category, recognize instances of that category and assign the correct category label
  • problem: a single object doesn't have a unique label

Categorization is tricky

  • realistic scenes are crowded, cluttered, and have overlapping objects
  • robustness: illumination, object pose, occlusions, viewpoint
  • complexity: volume of pixels, images, categories
  • importance of context

Supervised Classification

Given labeled examples, predict labels for new examples

Two general strategies:

  • Generative: model each class
  • Discriminative: model boundaries between classes

Generative Models

For a given measurement \(x\) and set of classes \(c_i\), choose \(c^*\) by \(c^* = max\{p(c|x)\}=max\{p(c)p(x|c)\}\)

If \(x\) is continuous, we need a likelihood density model of \(p(x|c)\)

This model is typically parametric, doesn't take much data to fit. - think Gaussian or mixture of Gaussians

Principal Components Analysis

Principal components are all about the directions in a feature space along which points have the greatest variance

Discriminative Models

Train

  • build an object model, a representation
  • learn/train a classifier

Test

  • generate candidates in new image
  • score the candidates

Discriminative Classifiers

  • nearest neighbor: Shakhnarovich 2003, Berg 2005
  • neural networks: LeCun 1998, Rowley 1998
  • SVMs: Guyon 2001
  • Boosting: Viola 2001, Torralba 2004, Opelt 2006
  • Random Forests: Breiman 1984, SHotton 2008

Viola-Jones Face Detection

"Rapid Object Detection using a Boosted Cascade of Simple Features" - Viola-Jones, 2001

Bag of Visual Words

Goal: summarize entire image based on its distribution of "word" occurrences

Stuff I didn't cover

  • color images
  • video analysis
  • scene understanding
  • openCV