Monday, April 16, 2012

Planning out rest of the project

We've completed our first experiment, and established the framework for our algorithm, which boils down to computing the following probability:





So far we have just used an unigram model based on color histograms to determine these probabilities. We want to add bigram model for spatial relationships between segments. In addition, we also want to incorporate a parse grammar of objects that models contextual relationships between segments.

We will need to assign weights to each model so that the total probability adds up to 1. We are still researching exactly how to learn what these ideal weights should be, we are hoping something like a CRF model will do the trick, or we can just play around with different values. Here's one potential example:

Feature          Unigram          Bigram
Color                  0.2                   0
Centroid              0                   0.2
Area                   0.05               0.2
Context              0.15                 0.2

For contextual relationships, we intend to use a parse grammar similar to the one described here

For ours, we would use segments instead of lines as the elements of the grammar. Ideally, we would like our grammar to represent the 3D geometry of the scene (as in the link above), but to start we will probably just identify some basic labels:

    Foreground                                      Background
       /           \                                     /              |          \
  table        chair                          wall         ceiling    floor
  /     \           /     \                          /    \                           \
legs  top    legs  top               door   window             carpet


Training data - need to figure this out

1) We can create our own training data from simple indoor scenes. For each image, we will have a manually segmented image for each level in the parse tree

If we construct our own data, how many images will we need for our training data?

2) Find something online? We're not sure if anything like what we've described exists.

How to split and merge clusters? (some ideas)

Ideally, we want our algorithm to "find" new objects in the scene, based on its current understanding of the scene. This involves increasing or decreasing the number of segments at each iteration, eventually converging to the proper amount.

For example, given clusters A-G, and labels table and chair. Let's say we know with high probability that cluster A is label 'table', and further we know from contextual relationships that if we see label 'table', then we should see label 'chair' nearby. 

If none of the nearby clusters correspond to a chair, (that is, the probability of P(B or C or D or E or F or G = 'chair' | A = 'table') is really low), then we split one of the clusters (increase k), to try to find a cluster with label 'chair'. Ideally, we want to split the cluster that we have the lowest confidence in its labels.

We split the cluster where we expect the part/object to be (using computer vision techniques, we know generally where it should be).

We merge clusters when they are empty, really small.

Friday, April 13, 2012

Experiment 1 Results

We completed our initial experiment to segment an image with a modified version of k-means by using the histograms for each label from the manual segmentation.
 Here are the results for one run, we have colored the segments according to the highest probable label.

The following table shows the final probabilities assigned to each cluster. 






Label


Cluster Pixel Count 1 2 3 4 5 6 7
A 4359 0.9854* 0.0023 0.0003 0.0063 0.0034 0.0002 0.002
B 2589 0.0002 0.0001 0.0002 0 0 0.9994 0
C 2758 0.0039 0.9592 0.0077 0.0003 0.0284 0.0001 0.0003
D 1465 0.3841* 0.2337 0.2917 0.0045 0.0522 0.0282 0.0056
E 220 0.1431* 0.1429 0.1429 0.1428 0.1429 0.143 0.1425
F 1114 0.0178 0.001 0.0019 0.5741 0.0181 0 0.387
G 264 0.0004 0.0125 0.8674 0.0002 0.1183 0.0006 0.0007





 Some interesting observations:
* Clusters A, D and E were assigned the same label
* Clusters E and G have very low pixel counts
* Labels 5 and 7 were not assigned to any cluster.

Need to have some mechanism to deal with clusters that have very low pixel counts


 Here are some more runs comparing our modified K-means with normal k-means. On average, our modified version does segment the image closer to the target segmentation, but there is plenty of room for improvement.

At this point we will start designing the parse tree for slightly more complicated indoor scenes, as well as a bigram model to exploit spatial and contextual relationships between objects. We will also design a mechanism to iteratively modify the value of k by merging and dividing clusters.

Wednesday, April 11, 2012

Experiment 1 Progress

For our progress on the first experiment, we are able to compute the following probability:

P(pixel p belongs in label i)

For each pixel p in our test image:


 Given our manual photoshop segmentation with labels i (1-7)




The following images represent the probability that each pixel belongs to the labels 1-7. The last image in each chart segments the image by assigning the pixel to the label with the highest probability (Note this does not incorporate the k-means algorithm yet)

sigma = 0.05
bin count = 15
Small sigma causes high contrast between segments


sigma = 0.2
bin_count = 15
Larger sigma causes less contrast between segments


sigma = 0.2
bin count = 40
Larger bin count causes more noise


In the following examples, we computed the histogram of the window around each pixel, which smoothed out the segmentation



sigma = 0.3
bin count = 15
window size = 3


sigma = 0.3
bin count = 15
window size =11
Larger window size increased smoothing, but also caused regions to spill over more


------

We are also planning to follow up on what Professor Belongie said in class about conditional random fields. 

We intend to read the following links. 




Monday, April 9, 2012

Design for getting started

To get started, we are performing the following experiment. We have a simple image with an apple on a table, and we've done the segmentation manually using photoshop.

 

We've divided the segments into the following labels:

1. Background
2. Table
3. Main apple body
4. Shaded area
5. Stem area
6. Specular highlight
7. Cast shadow

Using the mask on the right, we can compute the histogram of each label 1-7. Using this data, we will modify the k-means clustering algorithm (setting k = 7), so that it outputs a segmentation as close to the mask on the right as possible.

 

Method

To do this, we will use the following equation to compute the distance D between a pixel p and a cluster A:



where dA is the distance between pixel p and cluster A's mean in RGB feature vector space (as per normal k-means)  and c is just some parameter. We define λ as follows:



which we compute by marginalizing over our set of labels:

λc represents our confidence that we have both labeled cluster A correctly, and that pixel p belongs to that label. If this value is high, it will drive the distance D lower, increasing the likelihood of assigning pixel p to label A.  If our confidence is low, λ will go to 0 and we default to using k-means. We anticipate that λ will start low but increase after several iterations.

To compute P (cluster A is label i) for each label i, we will compute the chi-squared distance between each label i and cluster A, then use the soft-max formula to convert those distances to probabilities that add up to 1. We will do the same for P (p belongs to label i), however since p is just a single pixel, its histogram will just be an impulse in each RGB channel. We will also try using the histogram of an nxn window over the pixel area as well, and compare results.

Eventually we want to compute these probabilities using more than just histogram comparisons, (using the parse graph described in the proposal) but this will be a good start for now.

Wednesday, April 4, 2012

Current Progress (Wednesday April 4, 2012)

Right now we are exploring alternatives to k-means for segmentation. We are primarily looking at the mean-shift algorithm and a graph based approach. Both are explained well in these slides:
Using k-means with our algorithm
 
We currently have an idea how to assist k-means segmentation with basic object recognition, but we're not sure how we will do that with the graph based approach or mean shift. With k-means, at each iteration we loop through all the pixels and decide whether to put them in a certain cluster. Normally this is done simply by finding the closest cluster to the pixel in feature-vector space. But if we make estimates at each iteration on what object the cluster corresponds to, then we can also weigh in the likelihood that the pixel belongs to that object.
 
For example, let's say we want to find the probability that pixel p belongs in cluster A, given that we estimate the following probabilities for what object cluster A corresponds to

Cluster A:
Label     Probability
Apple             30%
Strawberry     20%
Raspberry      10%

We will use marginalization to find the likelihood that pixel p belongs to cluster A
 
P (p is in cluster A) =      P(p is part of an apple | cluster A is an apple) * P (cluster A is an apple)
                                        + P(p is part of a strawberry | cluster A is a strawberry) * P (cluster A is a strawberry)
                                        + P(p is part of a raspberry | cluster A is a raspberry) * P (cluster A is a raspberry)

If this probability is high, then we are more likely to put p in cluster A, even if the feature vector distance to the current mean of A is not the minimum.


Using other approaches for segmentation with our algorithm

With the mean-shift and graph based approaches, it isn't exactly clear yet how we will incorporate labeling guesses in assisting the segmentation. However, we do want to explore these other methods before we commit to using k-means.
 
Graph based approach (in more detail)

Abstract and Link to Full Proposal

Indoor Scene Understanding via Parse Graphs

Abstract:

     The aim of this project is to explore a new approach to scene understanding from a single image. Rather than attempt to extract the precise 3D structure of the scene, the goal is to identify contextual and spatial relationships between different objects in the scene. From this we can estimate an approximate geometric layout that is sufficient for most applications, particularly navigation.
     Many techniques for this purpose use a pipeline of image segmentation, followed by classification of individual segments, and then derive some understanding of the scene. Instead of a pipeline approach where each step is independent of the other(s), we will use an iterative algorithm whereby the segmentation, classification, and scene understanding processes all assist one another.
     To accomplish this, we will be using a variation of the k-means clustering algorithm, where clusters are labeled at each iteration, and these labels influence how pixels are assigned to clusters on subsequent iterations. In addition, we will construct a parse graph whose nodes are all possible labels, and whose edges are all possible relationships between labels. At each iteration, we improve our understanding of the scene by setting the likelihoods of each label and relationship, eventually simplifying the graph to one that best fits the scene. Using this, we can approximate the depths and positions of objects relative to one another.
 

Monday, April 2, 2012

First Post

We are currently looking for potential test images to start with, this one we used fall quarter for k-means clustering, so we might use this one again.