DarwinSenior/hw4.org

Problem 1 - VC space

Triangle

The triangle in the 2 dimensions should have VC space of 7. To prove that we consider the following

the triangle cannot shatter the when some of the points is not on the vertices of a convex hull. If we separate points into two groups, one that includes the vertices of a convex hull $p_out ∈ P$ and the other points that inside the convex hull $p_in ∈ P$, then the triangle cannot shatter the convex hull when $p_out$ are all negative and $p_in$ are all positive. This is true since if $p_out$ are labeled positive then $p_out$ are all inside the triangle, then the convex hull are inside the triangle. Thus, since $p_in$ are inside the convex hull, all $p_in$ are positive. Thus, it is false.
We are unable to shatter the convex hull of 8 vertices. Consider the convex hull of 8 vertices as illustrated in fig1. We are not be able to use a triangle to separate the following configuration. Positive: A, C, E, G Negative: B, D, F, H

We are able to shatter the convex hall of 7 vertices. If our convex hull has 7 vertices and is symmetric as illustrated in fig1, we have the following configuration

positive	negative	triangle
A	B,C,D,E,F,G	oga
A,B	C,D,E,F,G	ABo
A,C	B,D,E,F,G	ACo
A,D	B,C,E,F,G	ADp
A,B,C	D,E,F,G	ABC
A,B,D	C,E,F,G	ABD
A,B,E	C,D,F,G	ABE
A,C,E	B,D,F,G	ACE
A,B,C,D	E,F,G	ADY
A,B,C,E	D,F,G	ADB
A,B,D,F	C,E,G	aFD
A,B,C,D,E	F,G	AEc
A,B,C,D,F	E,G	AFc
A,B,C,D,E,F	G	eac

Thus, we see that the triangle have CV of 7.

Hyper-Rectangle

The VC dimension of a hyper-rectangle is 2d. And it follows the following reasoning.

There exists a configuration of 2d points that is separable by the

hyper-plane. For nd case, we could use the following points configuration: Each point live uniquely on the axis and the distance between one point to the center is one. So, for points $p_1 .. p_n$, the j-th component of the i-th point would be

$$\vec{p_i}_j = δ(i,j)$$

And for points $p_n+1..p_2n$ the j-th compoment of the i-th point would be

$$\vec{p_i}_j = -δ(i-n, j)$$

Thus, for a labeling $L = (l_1,..,l_n), l_i ∈ \{+, -\}$, the hyper-rectangle boundary should be $$∀ i ∈ [1..d]: -f(p_i+n) ≤ x_i ≤ f(p_i+n)$$ where we define

\[ \f(l) ≡ \begin{cases} 1 & l=+
0 & l=- \end{cases} \]

Thus, the hyper-rectangle has at least the VC dimension of $2n$

The hyper-rectangle of nd could not shatter VC dimension of $2n+1$ of any configuration.
There is no way we could do the following configuration:
- label all the points that dominates at least one axis +,
  $$∃ j, \vec{p}_j ≥ \vec{p’}_j, ∀ p’ ∈ P$$
- label all the points that was dominated at least one axis +,
  $$∃ j, \vec{p}_j ≤ \vec{p’}_j, ∀ p’ ∈ P$$
- label rest of the points -
Since there are at most $n$ points dominates, and at most $n$ points that are dominated, there are at least 1 points labels -. And since if the hyper-rectangle were to correctly label all the positive points, it must have labeled all the points +. And it means that it labels at least one points wrong. Thus, there is no configuration of $n+1$ points that the hyper-rectangle could shatter.

Thus, the VC dimension of the hyper-rectangle in nd is less than $2n+1$

Problem 2 - Kernels

Dual representation

The dual representation of perception is

$$\vec{w} = ∑i∑jαjyj(xixj)$$

where $α_j$ is the how many mistakes did the perceptron algorithm made on the jth input

Prove Kernel Function

$K_1(x, z) = x^Tz$ is a valid linear kernel.

Since any polynomial of a valid kernel is also a valid kernel, $K(x, z) = P(K_1), P(f) ≡ f^3+400f^2+100f$ is also a valid kernel

Show kernel function for DNF

To represent monotone DNF bounded by K, we need the following feature space.

$$\mathcal{H} = \{ \bigwedgei=0kx_i | x ∈ X \}$$

And thus, the function could be expressed as a linear combination of

$$\{ ∏i=0kx_i | x ∈ X \}$$

And thus, we could find a kernel maps to space $φ(x) = \langle ξ_1..ξ_k | ξ_1,..,ξ_k ∈ \{x_1,..,x_n,1\} \rangle$

Thus, we propose kernel can be expressed as $K(\vec{x}, \vec{z}) = (\vec{x}^T\vec{z}+c)^d$,

for this, it maps to space $φ(x) = \langle ξ_1..ξ_k | ξ_1,..,ξ_k ∈ \{x_1,..,x_n,\sqrt{2}c\} \rangle$

Problem 3 - Neural Network

Neural Network

Representation

Following the book (Machine Learning Tom Mitchell) I use the following notation for the neural network

$x_ji$ = the $i$th input to unit $j$
$w_ji$ = the weight associated with the $i$th input to unit j
$net_j$ = $∑_i w_jix_ji$ (the weighted sum of inputs for unit $j$)
$o_j$ = the output computed by unit $j$
$t_j$ = the target output for unit $j$
$R(x)$ = the relu function
$outputs$ = the set of units in the final layer of the network
$Downstream(j)$ = the set of units whose immediate inputs include the output of unit $j$

Training Rules for output unit weights

\begin{eqnarray*} \frac{∂ E_d}{∂ o_j} & = & \frac{∂}{∂ o_j}\frac{1}{2}∑_{k∈ outputs}(t_k-o_k)^2
& = & -(t_j-o_j) \ \frac{∂ o_j}{∂ net_j} & = & \frac{R(net_j)}{∂ net_j} \ & = & \frac{max(o_j, 0)}{o_j} \ \frac{∂ E_d}{∂ net_j} & = & \frac{∂ E_d}{∂ o_j}\frac{∂ o_j}{∂ net_j} \ & = & -\frac{(t_j-o_j)max(o_j, 0)}{o_j} \ Δ w_ij & = & -η \frac{∂ E_d}{∂ w_ji} \ & = & -η \frac{(t_j-o_j)max(o_j, 0)}{o_j}x_ji \end{eqnarray*}

Training Rule for hidden unit weights

The same step should be applied to the hidden units, thus, we have the following equations \begin{eqnarray*} \frac{∂ E_d}{∂ net_j} & = & ∑\limits_{k∈ DownSteam(j)} -δ_k w_kj \frac{max(o_j, 0)}{o_j}
δ_j &=& \frac{max(o_j,1)}{o_j} ∑\limits_{k∈ Downstream(j)}δ_k ω_kj \ Δ w_ji &=& η δ_j x_ji \end{eqnarray*}

Test on the data

The first experiment is for the different data on the neural network. The following are the working result from the data points.

	Learning rate	hidden nodes
/	<	>	<		>	<		>
Methods			tanh	tanh	tanh	ReLu	ReLu	ReLu
batch size			10	50	100	10	50	100
	0.1	10	100	87	69	100	100	98
	0.1	50	100	98	85	89	95	100
	0.01	10	69	53	51	87	60	53
	0.01	50	80	58	56	100	95	61

As we could see that there are multiple sets that get the 100% performance on the simple data sets. And we have the following observation: