The triangle in the 2 dimensions should have VC space of 7. To prove that we consider the following
- the triangle cannot shatter the when some of the points is not on the vertices of a convex hull. If we separate points into two groups, one that includes the vertices of a convex hull $pout ∈ P$ and the other points that inside the convex hull $pin ∈ P$, then the triangle cannot shatter the convex hull when $pout$ are all negative and $pin$ are all positive. This is true since if $pout$ are labeled positive then $pout$ are all inside the triangle, then the convex hull are inside the triangle. Thus, since $pin$ are inside the convex hull, all $pin$ are positive. Thus, it is false.
- We are unable to shatter the convex hull of 8 vertices. Consider the convex hull of 8 vertices as illustrated in fig1. We are not be able to use a triangle to separate the following configuration. Positive: A, C, E, G Negative: B, D, F, H
- We are able to shatter the convex hall of 7 vertices. If our convex hull has 7 vertices and is symmetric as illustrated in fig1, we have the following configuration
positive negative triangle A B,C,D,E,F,G oga A,B C,D,E,F,G ABo A,C B,D,E,F,G ACo A,D B,C,E,F,G ADp A,B,C D,E,F,G ABC A,B,D C,E,F,G ABD A,B,E C,D,F,G ABE A,C,E B,D,F,G ACE A,B,C,D E,F,G ADY A,B,C,E D,F,G ADB A,B,D,F C,E,G aFD A,B,C,D,E F,G AEc A,B,C,D,F E,G AFc A,B,C,D,E,F G eac
Thus, we see that the triangle have CV of 7.
The VC dimension of a hyper-rectangle is 2d. And it follows the following reasoning.
- There exists a configuration of 2d points that is separable by the
hyper-plane.
For nd case, we could use the following points configuration:
Each point live uniquely on the axis and the distance between one point
to the center is one.
So, for points
And for points $pn+1..p2n$ the j-th compoment of the i-th point would be
Thus, for a labeling
\[
\f(l) ≡ \begin{cases}
1 & l=+
0 & l=-
\end{cases}
\]
Thus, the hyper-rectangle has at least the VC dimension of
- The hyper-rectangle of nd could not shatter VC dimension of
$2n+1$ of any configuration.There is no way we could do the following configuration:
- label all the points that dominates at least one axis +,
$$∃ j, \vec{p}_j ≥ \vec{p’}_j, ∀ p’ ∈ P$$ - label all the points that was dominated at least one axis +,
$$∃ j, \vec{p}_j ≤ \vec{p’}_j, ∀ p’ ∈ P$$ - label rest of the points -
Since there are at most
$n$ points dominates, and at most$n$ points that are dominated, there are at least 1 points labels -. And since if the hyper-rectangle were to correctly label all the positive points, it must have labeled all the points +. And it means that it labels at least one points wrong. Thus, there is no configuration of$n+1$ points that the hyper-rectangle could shatter.Thus, the VC dimension of the hyper-rectangle in nd is less than
$2n+1$ - label all the points that dominates at least one axis +,
The dual representation of perception is
where
Since any polynomial of a valid kernel is also a valid kernel,
To represent monotone DNF bounded by K, we need the following feature space.
And thus, the function could be expressed as a linear combination of
And thus, we could find a kernel maps to space
Thus, we propose kernel can be expressed as
for this, it maps to space
Following the book (Machine Learning Tom Mitchell) I use the following notation for the neural network
- $xji$ = the \(i\)th input to unit
$j$ - $wji$ = the weight associated with the \(i\)th input to unit j
- \(net_j\) = \(∑_i wjixji\) (the weighted sum of inputs for unit \(j\))
- \(o_j\) = the output computed by unit
$j$ -
$t_j$ = the target output for unit$j$ -
$R(x)$ = the relu function -
$outputs$ = the set of units in the final layer of the network -
$Downstream(j)$ = the set of units whose immediate inputs include the output of unit$j$
\begin{eqnarray*}
\frac{∂ E_d}{∂ o_j} & = & \frac{∂}{∂ o_j}\frac{1}{2}∑k∈ outputs(t_k-o_k)^2
& = & -(t_j-o_j) \
\frac{∂ o_j}{∂ net_j} & = & \frac{R(net_j)}{∂ net_j} \
& = & \frac{max(o_j, 0)}{o_j} \
\frac{∂ E_d}{∂ net_j} & = & \frac{∂ E_d}{∂ o_j}\frac{∂ o_j}{∂ net_j} \
& = & -\frac{(t_j-o_j)max(o_j, 0)}{o_j} \
Δ w_ij & = & -η \frac{∂ E_d}{∂ wji} \
& = & -η \frac{(t_j-o_j)max(o_j, 0)}{o_j}xji
\end{eqnarray*}
Training Rule for hidden unit weights
The same step should be applied to the hidden units, thus, we have the following equations
\begin{eqnarray*}
\frac{∂ E_d}{∂ net_j} & = & ∑\limitsk∈ DownSteam(j) -δ_k wkj \frac{max(o_j, 0)}{o_j}
δ_j &=& \frac{max(o_j,1)}{o_j} ∑\limitsk∈ Downstream(j)δ_k ωkj \
Δ wji &=& η δ_j xji
\end{eqnarray*}
The first experiment is for the different data on the neural network. The following are the working result from the data points.
| Learning rate | hidden nodes | |||||||
|---|---|---|---|---|---|---|---|---|
| / | < | > | < | > | < | > | ||
| Methods | tanh | tanh | tanh | ReLu | ReLu | ReLu | ||
| batch size | 10 | 50 | 100 | 10 | 50 | 100 | ||
| 0.1 | 10 | 100 | 87 | 69 | 100 | 100 | 98 | |
| 0.1 | 50 | 100 | 98 | 85 | 89 | 95 | 100 | |
| 0.01 | 10 | 69 | 53 | 51 | 87 | 60 | 53 | |
| 0.01 | 50 | 80 | 58 | 56 | 100 | 95 | 61 |
As we could see that there are multiple sets that get the 100% performance on the simple data sets. And we have the following observation: