Skip to content

Instantly share code, notes, and snippets.

@DarwinSenior
Created October 19, 2016 03:58
Show Gist options
  • Select an option

  • Save DarwinSenior/3c12db99ae1d2fbd92a8067e636ead8b to your computer and use it in GitHub Desktop.

Select an option

Save DarwinSenior/3c12db99ae1d2fbd92a8067e636ead8b to your computer and use it in GitHub Desktop.

Problem 1 - VC space

Triangle

The triangle in the 2 dimensions should have VC space of 7. To prove that we consider the following

  1. the triangle cannot shatter the when some of the points is not on the vertices of a convex hull. If we separate points into two groups, one that includes the vertices of a convex hull $pout ∈ P$ and the other points that inside the convex hull $pin ∈ P$, then the triangle cannot shatter the convex hull when $pout$ are all negative and $pin$ are all positive. This is true since if $pout$ are labeled positive then $pout$ are all inside the triangle, then the convex hull are inside the triangle. Thus, since $pin$ are inside the convex hull, all $pin$ are positive. Thus, it is false.
  2. We are unable to shatter the convex hull of 8 vertices. Consider the convex hull of 8 vertices as illustrated in fig1. We are not be able to use a triangle to separate the following configuration. Positive: A, C, E, G Negative: B, D, F, H
  3. We are able to shatter the convex hall of 7 vertices. If our convex hull has 7 vertices and is symmetric as illustrated in fig1, we have the following configuration
    positivenegativetriangle
    AB,C,D,E,F,Goga
    A,BC,D,E,F,GABo
    A,CB,D,E,F,GACo
    A,DB,C,E,F,GADp
    A,B,CD,E,F,GABC
    A,B,DC,E,F,GABD
    A,B,EC,D,F,GABE
    A,C,EB,D,F,GACE
    A,B,C,DE,F,GADY
    A,B,C,ED,F,GADB
    A,B,D,FC,E,GaFD
    A,B,C,D,EF,GAEc
    A,B,C,D,FE,GAFc
    A,B,C,D,E,FGeac

Thus, we see that the triangle have CV of 7.

Hyper-Rectangle

The VC dimension of a hyper-rectangle is 2d. And it follows the following reasoning.

  1. There exists a configuration of 2d points that is separable by the

hyper-plane. For nd case, we could use the following points configuration: Each point live uniquely on the axis and the distance between one point to the center is one. So, for points $p_1 .. p_n$, the j-th component of the i-th point would be

$$\vec{p_i}_j = δ(i,j)$$

And for points $pn+1..p2n$ the j-th compoment of the i-th point would be

$$\vec{p_i}_j = -δ(i-n, j)$$

Thus, for a labeling $L = (l_1,..,l_n), l_i ∈ \{+, -\}$, the hyper-rectangle boundary should be $$∀ i ∈ [1..d]: -f(pi+n) ≤ x_i ≤ f(pi+n)$$ where we define

\[ \f(l) ≡ \begin{cases} 1 & l=+
0 & l=- \end{cases} \]

Thus, the hyper-rectangle has at least the VC dimension of $2n$

  1. The hyper-rectangle of nd could not shatter VC dimension of $2n+1$ of any configuration.

    There is no way we could do the following configuration:

    • label all the points that dominates at least one axis +,

      $$∃ j, \vec{p}_j ≥ \vec{p’}_j, ∀ p’ ∈ P$$

    • label all the points that was dominated at least one axis +,

      $$∃ j, \vec{p}_j ≤ \vec{p’}_j, ∀ p’ ∈ P$$

    • label rest of the points -

    Since there are at most $n$ points dominates, and at most $n$ points that are dominated, there are at least 1 points labels -. And since if the hyper-rectangle were to correctly label all the positive points, it must have labeled all the points +. And it means that it labels at least one points wrong. Thus, there is no configuration of $n+1$ points that the hyper-rectangle could shatter.

    Thus, the VC dimension of the hyper-rectangle in nd is less than $2n+1$

Problem 2 - Kernels

Dual representation

The dual representation of perception is

$$\vec{w} = ∑i∑jαjyj(xixj)$$

where $α_j$ is the how many mistakes did the perceptron algorithm made on the jth input

Prove Kernel Function

$K_1(x, z) = x^Tz$ is a valid linear kernel.

Since any polynomial of a valid kernel is also a valid kernel, $K(x, z) = P(K_1), P(f) ≡ f^3+400f^2+100f$ is also a valid kernel

Show kernel function for DNF

To represent monotone DNF bounded by K, we need the following feature space.

$$\mathcal{H} = \{ \bigwedgei=0kx_i | x ∈ X \}$$

And thus, the function could be expressed as a linear combination of

$$\{ ∏i=0kx_i | x ∈ X \}$$

And thus, we could find a kernel maps to space $φ(x) = \langle ξ_1..ξ_k | ξ_1,..,ξ_k ∈ \{x_1,..,x_n,1\} \rangle$

Thus, we propose kernel can be expressed as $K(\vec{x}, \vec{z}) = (\vec{x}^T\vec{z}+c)^d$,

for this, it maps to space $φ(x) = \langle ξ_1..ξ_k | ξ_1,..,ξ_k ∈ \{x_1,..,x_n,\sqrt{2}c\} \rangle$

Problem 3 - Neural Network

Neural Network

Representation

Following the book (Machine Learning Tom Mitchell) I use the following notation for the neural network

  • $xji$ = the \(i\)th input to unit $j$
  • $wji$ = the weight associated with the \(i\)th input to unit j
  • \(net_j\) = \(∑_i wjixji\) (the weighted sum of inputs for unit \(j\))
  • \(o_j\) = the output computed by unit $j$
  • $t_j$ = the target output for unit $j$
  • $R(x)$ = the relu function
  • $outputs$ = the set of units in the final layer of the network
  • $Downstream(j)$ = the set of units whose immediate inputs include the output of unit $j$

Training Rules for output unit weights

\begin{eqnarray*} \frac{∂ E_d}{∂ o_j} & = & \frac{∂}{∂ o_j}\frac{1}{2}∑k∈ outputs(t_k-o_k)^2
& = & -(t_j-o_j) \ \frac{∂ o_j}{∂ net_j} & = & \frac{R(net_j)}{∂ net_j} \ & = & \frac{max(o_j, 0)}{o_j} \ \frac{∂ E_d}{∂ net_j} & = & \frac{∂ E_d}{∂ o_j}\frac{∂ o_j}{∂ net_j} \ & = & -\frac{(t_j-o_j)max(o_j, 0)}{o_j} \ Δ w_ij & = & -η \frac{∂ E_d}{∂ wji} \ & = & -η \frac{(t_j-o_j)max(o_j, 0)}{o_j}xji \end{eqnarray*}

Training Rule for hidden unit weights

The same step should be applied to the hidden units, thus, we have the following equations \begin{eqnarray*} \frac{∂ E_d}{∂ net_j} & = & ∑\limitsk∈ DownSteam(j) -δ_k wkj \frac{max(o_j, 0)}{o_j}
δ_j &=& \frac{max(o_j,1)}{o_j} ∑\limitsk∈ Downstream(j)δ_k ωkj \ Δ wji &=& η δ_j xji \end{eqnarray*}

Test on the data

The first experiment is for the different data on the neural network. The following are the working result from the data points.

Learning ratehidden nodes
/<><><>
MethodstanhtanhtanhReLuReLuReLu
batch size10501001050100
0.110100876910010098
0.15010098858995100
0.0110695351876053
0.01508058561009561

As we could see that there are multiple sets that get the 100% performance on the simple data sets. And we have the following observation:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment