Phase 1

This section presents the implementation of panorama stitching considering the traditional formulation.The traditional approach involves the following steps: detecting key-points in images, extracting feature descriptors, and using Random Sample Concensus (RANSAC) to robustly estimate the homography matrix between images.

Implementation Overview

Corner Detection

A corner in an image is a region where the gradient shows significant changes in multiple directions. The Harris Detector, a popular algorithm for corner detection, analyzes intensity variations in a pixel’s local neighborhood to identify such points.

Harris Corner Detection

Adaptive Non-Maximal Suppression

While the Harris detector successfully identifies corners in the image, these features are not uniformly distributed. The purpose of Adaptive Non-Maximal Suppression (ANMS) is to enhance the spatial distribution of key points. Specifically, it aims to select the Nbest most optimal corners by identifying those that are true local maxima.

Adaptive Non-Maximal Suppression

Algorithm 1: Adaptive Non-Maximal Suppression
Data: Corner score Image C and the maximum of best corners
Nbest
Result: (xi , yi ) for i = 1 : Nbest
1 Initialized ri = ∞ for i = 1 : Nstrong ;
2 for i = [1 : Nstrong ] do
3 for j = [1 : Nstrong ] do
4 end
5 if C[yj ,xj ] > C[yi ,xi ] then
6 ED = (xj − xi )2 + (yj − yi )2 ;
7 end
8 if ED < ri then
9 ri = ED;
10 end
11 end
12 Sort ri in descending order and pick top Nbest points

Adaptive Non-maximal Suppression

Features Descriptor

The next step is to generate descriptors that facilitate comparisons between these key points. Each key point is represented by a feature vector or descriptor, derived from a patch centered on the key point that captures its distinguishing features. Various methods can be used to generate feature descriptors, including Gaussian functions, gradient orientations, or texture-based approaches. In this work, we adopted a straightforward method by applying a Gaussian function to the patch and subsequently sub-sampling the resulting blurred output.

Feature Descriptor

Features Matching

Now, the task is to identify the feature vector correspondences between the two images. To achieve this, for a given point in Image 1, we calculate the differences with all points in Image 2 and compute the squared Euclidean norm for each. Using this metric, we determine the ratio between the best match (lowest distance) and the second-best match (second lowest distance). If this ratio is below a predefined threshold, the match is retained; otherwise, it is discarded. We repeat this process for every feature point in Image 1.

Feature Matching

Random Sample Consensus (RANSAC) for outlier rejection

In the previous section, we demonstrated the ability to match feature vectors between images. However, the results above reveal that we are identifying not only correct correspondences but also incorrect ones. To mitigate these matches, this work employs RANSAC method to compute the homography using only the accurate matches between images.

Algorithm 2: RANSAC

Algorithm 2: RANSAC
Data: Four feature matches from images P and P'
Result: Homography matrix H between pair of images
1 for i = [1 : Nmax ] do
2   P ← Four random features from image 1
3   P' ← Four random features from image 2
4   H ← (P1, P1') Compute homography
5   p' − Hp < τ Compute inliers
6   if inlier > 90% then
7       break;
8   end
9 end
10 Keep the largest set of inliers
11 Re-compute least-squares Ĥ

Feature Matching after RANSAC

Image Stitching

Once the homography matrix between the pair of images is obtained, it becomes possible to combine multiple views into a single image, resulting in a seamless panoramic view. However, in this technique, it is quite important to blend the common region between images without affecting the regions that are not common.

Poisson Blending: This technique is used to seamlessly blend one image into another when stitching multiple images. Poisson blending achieves this by transferring gradient information from the source image to the destination image while maintaining the overall smoothness of the stitched result. This approach is based on solving a Poisson equation for the image to ensure that the gradients from the second image are maintained while minimizing discontinuities at the boundary of the stitch.

Alpha Blending: This technique combines multiple images using the value alpha, which represents the transparency of each image. By enabling smooth transitions between the foreground and background, it is particularly well-suited for tasks such as image compositing.

Image Stitching for different scenarios

Phase 2

This section presents the implementation of panorama stitching considering the deep learning network approaches. The deep learning approach can be performed in two different ways:

Supervised Learning
Unsupervised Learning

Dataset Generation

To train both supervised and unsupervised approaches for estimating homography between pairs of images, we need datasets comprising pairs with known homographies. The procedure of obtaining these datasets is generally challenging, as determining the homography requires knowledge of the transformations between image pairs. Consequently, synthetic pairs of images are generated to train the networks, using a small subset of images from the MSCOCO dataset, which encompasses a diverse array of objects in natural scenes. The steps to generate the images are as follows: Before generating image pairs, all the pairs should be of the same size and the first step is to resize the images to a standard dimension. In our implementation, we resize the images to 320x320.

We then apply a random crop on the images in feature-rich regions using corner detection method. This is our original patch PA having corners CA.
On this patch PA we apply random transformation HAB to get a perturbed patch with corners CB.
We calculate the inverse homography HA that maps from the perturbed patch to PA and call this the patch PB that has the same coordinates in the transformed image as that of PA in the original image.

The dataset comprises 5000 images, from which 30 patches are generated for each image using the aforementioned procedures, resulting in a total of 150,000 images.

Supervised Approach

A supervised learning approach is a type of machine learning in which a model is trained on labeled data. This means that the algorithm learns from a dataset that already contains input-output pairs, where each input is associated with a correct output. In our implementation, the input of the model is the pair a pair of gray-scale image patches, and the ground truth labels are the homography between them. The objective of this optimization problem is to minimize the error between the estimated output of the neural network and the ground-truth homography. In this work we use the structure presented in HomographyNet.

Structure of the HomographyNet

The patch size for each image is 128x128, and the patches PA and PB are normalized and stacked on top of each other to obtain a single input.

HomographyNet: We use the HomographyNet, a deep CNN having 8 convolutional layers and 2 fully connected layers. Each layer has 3x3 convolutional blocks with BatchNorm and ReLUs. The network takes as input a two-channel gray-scale image i.e the image stack PA and PB sized 128x128x2. We use 8 convolutional layers with a max pooling layer (2x2, stride 2) after every two convolutions. The 8 convolutional layers have the following number of filters per layer: 64, 64, 64, 64, 128, 128, 128, 128. The convolutional layers are followed by two fully connected layers. The first fully connected layer has 1024 units. Dropout with a probability of 0.5 is applied after the final convolutional layer and the first fully-connected layer. The output of the network directly produces 8 real-valued numbers which are the predictions of the difference of perturbation from CA to CB.

Internal structure of the supervised network

Hyperparameters: The hyperparameters used for the model are as follows:

optimizer: SGD
learning rate: 0.006
loss metric: L1 norm
epochs: 100
batch size: 64

Experiments and Results: The model is trained on the dataset of size 150,000 image pairs for 100 epochs for about 8.0 hours on a single NVIDIA RTX GPU. Each image is resized to 320x320 dimensions and subsequently converted to gray-scale.

Comparison results of the supervised and traditional approach considering validation images

Cost values during training and validation supervised structure

Unsupervised Approach

The unsupervised learning approach is a type of machine learning in which the model is trained using data that are not labeled, meaning that the input data do not have ground truth for comparison. In unsupervised learning, the goal is for the model to find patterns, relationships, or structures in the data without the need for explicit supervision.

Structure of the unsuervised approach

HomographyNet: The model pipeline is an extension of the previous model, where the same HomographyNet model is used and two more blocks are added, namely the TensorDLT and Spatial transformer blocks. The HomographyNet model works the same way in the forward iteration of the model giving the same H4pt output.

The TensorDLT layer takes this output and the corners CA and calculates the estimated homography H̃BA.
The spatial transformer takes the original image IA and warps it using estimated homography to give an estimated image I˜B .
Using the coordinates CA we extract the estimated patch P̃B from the estimated image I˜B .
The patches PB and P̃B are compared using L1 loss function and the weights of the model are updated to reduce this loss.

Hyperparameters: The hyperparameters used for the model are as follows:

optimizer: SGD
learning rate: 0.0001
loss metric: L1 norm
epochs: 50

Experiments and Results: The model is trained on the dataset of size 150,000 image pairs for 50 epochs for about 12 hours on an NVIDIA A30 GPU. All images are resized to 320x320 and converted to grayscale as done for the supervised training. The inputs are provided to the model and the H4pt values are obtained which are given to the tesnorDLT layer along with CA to obtain the estimated homography matrix. We use kornia transform to warp the image IA to obtain the estimated image I˜B . Applying the coordinates CA to I˜B gives us the estimate patch P̃B . The loss L1 is calculated for PB and P̃B and the model weights are updated based on the loss value.

Loss values during training and validation of the unsupervised structure