Learning Graph Cut Energy Functions for Image Segmentation

In this paper we address the task of learning how to segment a particular class of objects, by means of a training set of images and their segmentations. In particular we propose a method to overcome the extremely high training time of a previously proposed solution to this problem, Kernelized Structural Support Vector Machines. We employ a one-class SVM working with joint kernels to robustly learn significant support vectors (representative image-mask pairs) and accordingly weight them to build a suitable energy function for the graph cut framework. We report results obtained on two public datasets and a comparison of training times on different training set sizes.


I. INTRODUCTION
Many computer vision applications require the precise identification of objects within a scene, and often the segmentation and selection of a particular target class [1], [2].This binary class-specific segmentation problem, unlike other segmentation challenges, is well posed and its performance can be accurately measured by counting the number of mislabeled pixels.
Nowadays large repositories of annotated images are available, so an interesting approach would be to have a generic segmentation algorithm, specifically trained for the object class of interest.To obtain the desired flexibility, computational complexity must also be taken into account, focusing on solutions that are fast and accurate.
Currently, s/t graph cuts [3] are considered one of the most effective techniques in image segmentation, due to the wide range of energy functions that can be minimized using efficient max-flow algorithms [4], [5].Recently, several works addressed the problem of introducing high level information into the graph cut energy functions [6], [7], in order to obtain flexible solutions, also applied to 3D images [8], [9].An important incentive to this field was given by the medical imaging segmentation community [10], [11], [12], where the graylevel images and the low contrast make low level approaches unfeasible.The ability to learn a suitable energy function would make the system more flexible and easily adaptable to different settings.
Bertelli et al. [13] faced the supervised class-specific segmentation problem using Kernelized Structural Support Vector Machines (KSSVMs), achieving good results on several datasets.Unfortunately, KSSVMs can not be applied to large scale datasets because of their complexity [14], and usually only linear kernels are employed in conjunction with Structural SVMs [15] to reduce the training phase.

Horses Flowers
Fig. 1.Sample images from the two datasets (first row), the provided ground truth (second row) and the segmentation of the proposed approach (third row).The grey areas in the ground truth of the flower dataset are pixels labeled neither as foreground nor background.
In this paper, we focus on a generative learning technique for structured prediction, here applied to binary segmentation.Our proposal is able to dramatically reduce the training time, when compared to discriminative approaches.We exploit joint kernels on image-mask pairs, used in a one-class SVM to learn the energy function minimized in a graph cut framework.We discuss its application on two publicly available datasets, where we demonstrate that our proposal's performance is similar to the more complex and time consuming KSSVM.Fig. 1 shows some samples taken from the two datasets and the provided ground truth data.
The paper is organized as follows: in Section II we introduce the problem of structural segmentation also highlighting some limitations of one common approach, Sections III and IV describe the proposed method while in Section V and VI we give some further implementation details.In Section VII we present experimental evaluations on different datasets, and Section VIII summarizes the contributions of the paper.

II. STRUCTURAL SEGMENTATION
Structural prediction through SSVMs [16] proved to be effective in many computer vision tasks, such as scene recognition [17], object detection [18], tracking [19] and recently also image segmentation [20], [21], [13].Structured segmentation describes the problem of learning a function where X is the space of samples (images) and Y is the space of structured labels (binary masks).To learn f we assume that a training set of image-mask pairs (x 1 , y 1 ), ..., (x n , y n ) is available.SSVM learns a scoring function F (x, y) that matches a sample x with a label y, such that maximizing F through the label space gives the correct output label for sample x.
A common approach is to have F in the form of a linear function: F (x, y, w) = w φ(x, y), where w ∈ R n is a parameter vector and φ(x, y) is a joint feature vector.The definition of an explicit feature vector φ(x, y) can be very difficult, thus we will work in the dual formulation using positive definite joint kernels As defined in [13], the scoring function F (x, y) can be written as: where W is the set of the most violated constraints, and α are the weights for the support vectors that are found solving the dual problem.Given an input image x, we can find the output label by maximizing F (x, y): This maximization can be done using graph cuts as demonstrated in [13].Unfortunately, this formulation has two relevant performance issues: • during training we have to construct the set of the most violated constraints W: for each training sample, find k constraints (k depends on the desired accuracy), each with the size of the training set, and solve an inference step for each element; • during testing we have to compare a sample x with each support vector, composed of every training sample and its most violated constraint, as in (4).

III. ENERGY FUNCTIONS MODELING
The main idea behind the proposed model, summarized in Fig. 2, is to exploit one-class SVMs in a kernel space to learn a set of support vectors and their relative weights and to delete outliers from the training set, thus reducing the complexity at testing time.This idea has been firstly introduced by Lampert et al. [14], with the name of Joint Kernel Support Estimation, and applied to object localization and sequence labeling.Given a training set of sample-label pairs (x 1 , y 1 ), ..., (x n , y n ) we want to model the probability density function p(x, y), and use f (x) = arg max p(x, y) for prediction.Assuming that p(x, y) is high only if y is a correct label for x, we only have to find the support of p(x, y).This can be effectively obtained by a one-class support vector machine (OC-SVM).We can express p(x, y) as: p(x, y) = exp(w φ(x, y)).
As mentioned in Section II, it is difficult to find an explicit formulation of φ(x, y), while it is easier to find a suitable joint kernel K that matches two sample-label pairs.The joint kernel can be an arbitrary Mercer kernel [22].The output of the OC-SVM learning process becomes a linear combination of kernel evaluations with training samples, thus the prediction function can be formulated as: The selected support vectors are the training samples that have non zero α.The learning process can be done using standard existing implementations of OC-SVM, replacing the kernel matrix within samples with the joint kernel matrix between sample-label pairs.
It is important to point out the difference between our approach and KSSVMs: in the training phase we only have to construct the joint kernel matrix between training samples, and then train a standard non linear OC-SVM, no inference steps are required during training.As a consequence, the training time does not depend on the structure of the output space, but only on the size of the training set.

IV. DEFINING THE KERNELS
Joint kernels between image-mask pairs allow us to model complex relationships between samples.As proposed in [13], we choose to formulate the similarity kernel as the product of an image kernel and a mask kernel: K((x i , y i ), (x j , y j )) = θ(x i , x j ) • Ω(x i , x j , y i , y j ), (8) where θ(x i , x j ) measures the similarity of the objects depicted in x i and x j , and acts as a weight for the mask similarity kernel Ω(x i , x j , y i , y j ).Consequently, if the two images are very different, the final similarity measure will be low, even if the masks are similar.

A. Image Similarity Kernel
The purpose of the image similarity kernel is to return high similarity values between images that contain very similar objects.We adopt a general purpose similarity measure between images, the comparison of HOG descriptors [23], although many other descriptors could be used without changing the model [24], [25].HOGs can be compared using standard similarity measures like Bhattacharyya distance.Since we are working with images of the same category (e.g.flowers), distances within an entire dataset don't change so much; as a consequence "good" and "bad" samples are weighted similarly, leading to errors at classification time.A better choice is to employ a Gaussian kernel, capable of better distinguishing between different images, due to the parameter σ, optimized for a specific dataset.The image similarity kernel between image x i and image x j becomes: where ρ(x i ) is the feature vector extracted from image x i .Fig. 3 shows the different results obtained by using Bhattacharyya distance or the Gaussian kernel, to better understand the consequences at classification time.For the computation of the HOG descriptors we adopted rectangular HOG (R-HOG) [23], computing gradients on R,G, and B color channels and taking the maximum, then dividing the image with a 5×5 grid of cells (25 cells), and grouping them in 4 partially overlapped blocks of 3×3 cells each.Trilinear interpolation between histogram bins and cells was appropriately applied.The HOG feature is computed using 9 bins to quantize the orientation, leading to 9 (bins) × 9 (cells per block) × 4 (blocks) = 324 features.

B. Mask Similarity Kernel
The mask similarity kernel takes into consideration both images and masks to extract knowledge about how comparable two segmentations are.The kernel is composed of a linear combination of three parts: The first kernel Ω 1 (y i , y j ) only depends on the binary masks, and directly compares the similarity between the two, by counting the number of corresponding pixels: where P is the total number of pixels in the image, y ip is the p-th pixel of image y i , and δ(•, •) is an indicator function defined as: The second and the third kernel exploit 3D color histograms computed in the RGB space.Let's define F j i and B j i as foreground and background histograms extracted from image x i using mask y j , and P r(x p | H) as the likelihood of pixel x p to match histogram H.We use negative log-likelihoods to express the penalties to assign a pixel to foreground or to background, as firstly introduced by [3].Negative loglikelihoods are defined as: We can also define: To highlight the mutual agreement of two masks, the second kernel extracts an histogram from image x i using mask y j and evaluates it using y i .Having F j i and B j i : The third kernel exploits global features extracted from the entire training set to model the expected color distribution of foreground and background pixels.We define F G and B G as the global histograms extracted from training samples using their respective masks. where The histograms are quantized uniformly over the 3D color space using a fixed number of bins per channel, set to 16 by experimental evaluations (no smoothing is applied).

V. GRAPH CONSTRUCTION
It is worth noting that the previously defined kernels compare two image-mask pairs, while at testing time the test mask is obviously missing.Kernels must thus be reformulated so to return pixel-wise potentials, in order to perform the maximization reported in (7).This maximization is done using s/t graph cuts [3].
The problem can be formulated as a maximum a posterior estimation of a Markov Random Field, minimizing the energy function: where y is a binary vector of pixel labels and R(y) is the unary term expressing the cost of assigning a pixel to the foreground or to the background.B(y) is the smoothness term, formulated as proposed by Rother et al. [26]: ) where N is the set of neighboring pixels (8-connected), δ(•, •) is the indicator function defined in (12), dist(p, q) is the distance between pixels and σ is the expectation of the euclidean distance in color space x p − x q 2 .At classification time we have to compute the foreground and background potentials P f and P b corresponding to the unary term in the graph cut framework.They are the result of a linear combination of potentials P f i and P bi obtained from the comparison of the testing image x j with each support vector (x i , y i ), weighted by the corresponding α i .The potentials at position p are: where θ(x i , x j ) is the image similarity kernel defined in (9).The first kernel is strictly related to the mask y i : The second kernel expresses the cost of assigning a pixel to foreground or to background, according to the histograms F i j and B i j , defined in Section IV-A: The third kernel expresses the cost of assigning a pixel to foreground or to background, given the global histograms F G , B G calculated on the training set: where VI. PARAMETER OPTIMIZATION Some parameters of the proposed approach have been optimized on a validation set, maximizing the segmentation accuracy.These are the weights of the mask kernels β 1 ,β 2 and β 3 , the ν of OC-SVM, and the λ of the graph cut framework.The parameter ν is used in the OC-SVM to specify an upper bound to the percentage of outliers, the higher the ν, the higher the percentage of training samples that can be ignored by the OC-SVM, introducing robustness against outliers.The parameter λ of the graph cut weights the importance of the smoothness term in the final energy function and must be optimized because the unary potential R(y) changes when the kernel weights change.
The optimization is done iteratively, by changing one parameter at a time.Each parameter has an optimization range and a step size.The best value within the range is searched and the range is recentered on it.If the current center does not change, then the step size is halved and the optimization process moves to the next parameter.This operation is repeated for a fixed number of iterations (set to 4 by experimental evaluation).For all the parameters that affect the training phase (β 1 , β 2 , β 3 and ν), the OC-SVM is retrained to test each new value.A detailed discussion on the parameter values on the different datasets is given in Section VII-A.

VII. EXPERIMENTAL EVALUATION
We tested the proposed method on two publicly available datasets, containing images of flowers and horses.The first is the Weizmann horse dataset [27], that contains 328 images of horses with strong differences in background, contrast and pose.The second is the Oxford flower dataset [28], composed of 849 images of flowers belonging to different species.
All the images are resized to the same dimension to allow the kernel computation, the chosen size is 256×256 for both the datasets.We split each dataset in three parts and trained our method on the first one, optimized the parameters on the second one, and tested the system on the third one; eventually we exchanged the parts and averaged the results (three tests are conducted for each experiment).
We used two metrics to evaluate the segmentation performance: the pixel-wise accuracy S a , that measures the percentage of correctly labeled pixels, and the intersection-over-union metric S o , defined as the intersection of the output mask and the ground truth mask divided by the union of the two masks.Mp="obj"∧M GT p ="obj" P p=1 Mp="obj"∨M GT p ="obj" (25) where M is the output mask of the system, M GT is the ground truth mask and δ(•, •) is the indicator function defined in (12).We firstly compared our solution with the one proposed in [13] on the flower dataset (Table I).As a comparison we also report the results obtained by [28].Here it is important to note that the method proposed in [28] is strictly related to the domain of flowers, because it exploits a flower shape model made of center and petals.Moreover, our method is capable to obtain the same results of the KSSVM while dramatically reducing the computational complexity; this quickly becomes a key feature when dealing with larger datasets.
On the horses dataset we compared our method with the KSSVM framework and with the GrabCut framework [26] as a baseline.We tested different automatic initialization strategies for GrabCut: the first is done with the bounding boxes coming from the part based detector by Felzenszwalb et al. [29], [30], the others employ the average of the masks of the k nearest images found with Eq. ( 9).KSSVMs perform slightly better, probably due to their discriminative nature that allows to put aside wrong (but feasible) segmentations, exploiting the most violated constraints.

A. Parameter optimization
To understand the meaning of the parameters involved in the method, we summarized in Table II their optimized values.The optimization leads to different values for the two datasets, and some observations can be made: • the first mask kernel is more important for the horses, probably due to the pose homogeneity through the dataset; • the second mask kernel, which exploits information from both images and masks, is the most important, and receives the higher weight in both the datasets; • the third mask kernel, which employs global color information learned from training, is more important for the flowers, and this is certainly related to the homogeneous background of flower images, that often depict grass or soil; • parameter ν (for details see Section VI) is higher in the flowers dataset and this means that is convenient to ignore a certain percentage of training samples (about 45%), that do not provide important information.

B. Training Time Comparison
To highlight the difference in terms of training time between our approach and KSSVMs, we chose the largest dataset at our disposal, that is the Oxford flowers dataset, and compared training performance increasing the number of training samples from 20 to 800.Given that the code for KSSVMs is not publicly available, we used our implementation, based on the LaRank classifier [31].In Fig. 5 a comparison between the two methods is reported.Although the training time of our approach increases in a non-linear manner due to the exponential number of kernel computations needed, KSSVM training time is one or two order of magnitude higher, and increases in a non-linear manner too.

VIII. CONCLUSIONS
We proposed a novel segmentation approach based on oneclass SVMs and joint kernels between image-mask pairs.The method exploits the ability of OC-SVMs to identify and ignore outliers in the training set, while reducing the number of kernel computations needed at classification time.The characteristics of this generative learning algorithm allow faster training and testing phases when compared to discriminative approaches like KSSVMs while reaching comparable performance.

Fig. 3 .
Fig. 3. HOG feature distance using Bhattacharyya (a) measure or a Gaussian kernel (b).The Gaussian kernel is able to separate with a larger gap flowers that are similar to the query from other flowers.

Fig. 4 .
Fig. 4. Segmentation results on the flowers and the horses datasets.

S a = 1 PP
p=1 δ(M p , M GT p ) S o = P p=1

Fig. 5 .
Fig. 5. Comparison of the time required to train the system with KSSVM and our proposal.

TABLE I .
PERFORMANCE COMPARISON ON THE WEIZMANN HORSESAND OXFORD FLOWERS DATASETS.

TABLE II .
OPTIMIZED PARAMETERS ON THE TWO DATASETS.