# Qiaojun Feng

### CSE 291I: Machine Learning for 3D Data - Multidimensional Scaling

This was from an assignment of CSE 291I: ML on 3D Data in Winter 2018.

Wikipedia says Multidimensional Scaling(MDS) is a means of visualizing the level of similarity of individual cases of a dataset. This is from the perspective of its application. Essentially, Multidimensional Scaling can be regarded as a “distance preserving” embedding of the data into a new Euclidean space. When the new Euclidean space has lower dimension like 2D or 3D, we can use it to visualize.

To be more concrete on distance preservation we can define a optimization problem. Suppose there are $$n$$ points in $$N$$-dimensional space, i.e. $$x_1,x_2,\dots,x_n \in \mathbb{R}^N$$, and we want to embed them into $$M$$-dimensional space, and the corresponding embeddings are $$x’_1,x’_2,\dots,x’_n \in \mathbb{R}^M$$. Define the Euclidean distance between two points $$d_{ij} = \|x_i-x_j\|_2$$, $$d’_{ij} = \|x’_i-x’_j\|_2$$. The goal is to preserve the distance between two points as much as possible, i.e. to find the $$x’_1,x’_2,\dots,x’_n$$ that minimize

$\text{Stress}(x_1,x_2,\dots,x_n) = (\sum_{i<j}(d’_{ij}-d_{ij})^2)^{\frac{1}{2}}$

A more general definition of the optimization objective is

$\text{Stress}(x_1,x_2,\dots,x_n) = \frac{(\sum_{i<j}(d’_{ij}-f(d_{ij})^2)^{\frac{1}{2}}}{d_{ij}^2}$

For example, Sammon Mapping uses $$\text{Sammon Stress}(x_1,x_2,\dots,x_n) = \frac{1}{\sum_{i<j}d_{ij}}\frac{(\sum_{i<j}(d’_{ij}-d_{ij})^2)^{\frac{1}{2}}}{d_{ij}^2}$$

When the $$n$$ is small, for the first Stress mentioned above we can get the analytical solution directly. But since it needs matrix's eigenvalue decomposition, it couldn't work when $$n$$ is large which is usually the case. Instead we can solve the optimization problem by gradient descent. I implemented it using Tensorflow and there are 2 points that I found interesting during the implementation.

• 1.Matrix operation for distance calculation

For naive implementation of calculating the pairwise distance matrix the complexity is $$O(n^2)$$. Instead of two for-loop, we can find that(taking $$x_i$$ as row vector)

\begin{aligned} d_{ij} &= \sqrt{(x_i-x_j)(x_i-x_j)^T} = \sqrt{x_ix_i^T-2x_ix_j^T+x_jx_j^T} \\ D &= \sqrt{r-2X^TX+r^T} \\ r &= [\|x_1\|_2^2,\|x_2\|_2^2,\dots,\|x_n\|_2^2]^T \end{aligned}

• 2.Gradient explosion by square root

I struggled for quite a long time while implementing the code, getting NAN results after only a few iterations. After trying a few different hyperparameters I had to come back to write down the calculation again. Then I found the square root operation while calculating the distance matrix may cause some problem.

$y = \sqrt{x} \quad \frac{\partial y}{\partial x} = \frac{1}{2\sqrt{x}} \quad \nabla x = \nabla y \frac{\partial y}{\partial x}$

Here $$y$$ is the Euclidean distance between 2 points and $$x$$ is the square of $$x$$. By backpropagation, if $$\sqrt{x}$$ is very small, even $$\nabla y$$ is not so big, $$\nabla x$$ can be very big, causing gradient explosion. The solution is to constrain $$\nabla x$$ by setting a maximum value.

We use MNIST dataset(original in $$28^2=784$$ dimensional space) and embed it to 2D space for visualization. From $$X \in \mathbb{R}^{n \times 784}$$ to $$X’ \in \mathbb{R}^{n \times 2}$$ we just used a simple matrix multiplication.

$W \in \mathbb{R}^{784 \times 2} \quad X’ = WX$

The images below are the results we get.

 Random initial state
 Final converage state (the gray part is “1”)