This package aims to implement a series of operations first described in Grohs et al. (2023), Petersen and Voigtlaender (2018), Grohs, Jentzen, and Salimova (2022), Jentzen, Kuckuck, and Wurstemberger (2023), and a broad extension to that framework of operations as seen in Rafi, Padgett, and Nakarmi (2024). Our main definitions will be from Rafi, Padgett, and Nakarmi (2024), but we will also delve deeper into the literature when necessary.
Our definition of neural networks will be ordered tuples of ordered pairs. A neural network is something like \(((W_1,b_1),(W_2,b_2), (W_3,b_3))\). Where each \(W_i\) is a weight matrix and each \(b_i\) is a bias vector. You may create neural networks by specifying a list, that will indicate the number of neurons in each layer.
layer_architecture = c(4,5,6,5)
create_nn(layer_architecture)
#> [[1]]
#> [[1]]$W
#> [,1] [,2] [,3] [,4]
#> [1,] 0.85140551 1.0236541 -0.8268231 0.3083593
#> [2,] 1.50691130 0.2038044 -0.2377545 -2.9352571
#> [3,] -0.73066553 -0.9679651 -0.1427269 -0.1825402
#> [4,] -0.07630063 1.3203687 0.3892640 0.6214547
#> [5,] -0.73016343 -0.2305575 -0.4691877 0.2473949
#>
#> [[1]]$b
#> [,1]
#> [1,] 1.7135053
#> [2,] 0.6213685
#> [3,] 1.0899444
#> [4,] 0.8488450
#> [5,] -0.2936087
#>
#>
#> [[2]]
#> [[2]]$W
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -1.87545287 1.64854447 -1.2096510 0.4229130 0.2099359
#> [2,] 0.06281382 -0.02134632 0.6324472 0.3117254 -1.3206046
#> [3,] 0.24629255 1.63019104 -0.3413203 0.4736401 -0.3721702
#> [4,] 0.46896559 0.13932554 -1.6105054 0.8181013 0.6419611
#> [5,] -0.80069104 1.93395324 -0.4264567 -2.4306414 -0.2599279
#> [6,] -1.24317782 0.14920674 0.1831347 0.1094383 -0.8187282
#>
#> [[2]]$b
#> [,1]
#> [1,] -0.2112726
#> [2,] 0.4037090
#> [3,] 0.9648242
#> [4,] 2.2083125
#> [5,] 1.1659607
#> [6,] 0.6493405
#>
#>
#> [[3]]
#> [[3]]$W
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] -0.1653070 -0.5835176 -0.011678293 -0.1931576 0.69920273 0.2858378
#> [2,] -0.7725770 -2.0002639 0.317312830 0.7453666 -0.02718446 2.0467754
#> [3,] -0.8695696 0.1255744 -0.008505474 -0.4816421 0.87495184 -1.3960953
#> [4,] 0.5675979 -1.0791620 -0.259826015 -0.8891931 -1.12603454 0.3489476
#> [5,] -0.3657891 -0.2361163 2.196284752 -0.4710113 0.72754301 0.8477316
#>
#> [[3]]$b
#> [,1]
#> [1,] 1.0028620
#> [2,] 0.4516713
#> [3,] 1.0656108
#> [4,] -1.1100011
#> [5,] 2.2637647
Note that each weight and bias vector will be populated from a standard uniform distribution. Associated with each neural network will be a family of functions:
dep
: showing the depth of a neural network:
param
: showing the number of parameters (sum of the
total number of elements in each weight matrix and bias vector):
inn
: the width of the input layer:
out
: the width of the output layer:
hid
: the number of hidden layers, always defined to
be one less than depth:
lay
: a list specifying the width of each layer,
often will be called the layer architecture:
Instantiation refers to the act of applying an activation function between each layer and resulting in a continuous function. Only three such activation functions have been implemented ReLU, Sigmoid, and Tanh.
Because the current theoretical frameworks of Rafi, Padgett, and Nakarmi (2024), Jentzen, Kuckuck, and Wurstemberger (2023), Grohs et al. (2023) only show theoretical results for ReLU activation, our Xpn, Sne, and Csn, among others will only show approximations under ReLU activations.
Instantiations must always be accompanied by the appropriate vector \(x\) of the same length as the input layer of the instantiated neural network. See examples:
create_nn(c(1, 3, 5, 6)) |> inst(ReLU, 8)
#> [,1]
#> [1,] -3.1458857
#> [2,] -3.2120874
#> [3,] -0.4812446
#> [4,] 0.9645233
#> [5,] 0.9013140
#> [6,] -5.9451107
Instantiation will have a special symbol \(\mathfrak{I}\), and instantiation with ReLU will be denoted as \(\mathfrak{I}_{\mathsf{ReLU}}\).
Composition has the wonderful property that it works well with instantiation, i.e. instantiation of two composed neural networks are the same as the composition of the instantiated continuous functions, i.e.
\[ \mathfrak{I}_{\mathsf{ReLU}} \left( \nu_1 \bullet \nu_2\right)(x) = \mathfrak{I}_{\mathsf{ReLU}}\left( \nu_1\right) \circ \mathfrak{I}_{\mathsf{ReLU}}\left( \nu_1\right)(x) \]
Note: When composing, the output layer width of the innermost neural network must match the input layer width of the outer neural network.
c(1,5,6,3,3) |> create_nn() -> nu_1
c(3,4,6,3,2) |> create_nn() -> nu_2
nu_2 |> comp(nu_1)
#> [[1]]
#> [[1]]$W
#> [,1]
#> [1,] -0.9855632
#> [2,] -1.4865799
#> [3,] -1.2355514
#> [4,] 0.3341896
#> [5,] -1.2005095
#>
#> [[1]]$b
#> [,1]
#> [1,] 0.8807130
#> [2,] 0.2512977
#> [3,] -0.5231943
#> [4,] -0.2781542
#> [5,] -0.2836871
#>
#>
#> [[2]]
#> [[2]]$W
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -0.3055717 2.1506307 -0.52261280 -1.21737288 -0.6695789
#> [2,] -0.8604663 0.1177763 -0.03084026 -0.45080752 -1.6310799
#> [3,] 2.5922746 0.4008101 -0.82438082 -0.04470601 -0.3125987
#> [4,] 0.5597969 -1.0817910 0.18796867 1.01674722 1.2508699
#> [5,] -0.5599006 -1.5737679 -0.11866320 2.30998627 0.2252668
#> [6,] 2.4691639 -0.6180974 3.01875898 -0.26827873 -0.9088813
#>
#> [[2]]$b
#> [,1]
#> [1,] 0.7886517
#> [2,] -0.4919067
#> [3,] 0.4477928
#> [4,] 0.3254710
#> [5,] -1.7181613
#> [6,] 0.1117790
#>
#>
#> [[3]]
#> [[3]]$W
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.2234445 -0.5695317 0.3559395 1.1096469 -0.6771058 0.7035657
#> [2,] 0.1299551 1.5532052 -0.5529983 -0.7975693 -0.0694404 0.3613289
#> [3,] 0.4779458 0.4065353 -0.8117675 -0.9868587 1.2439006 -0.7472673
#>
#> [[3]]$b
#> [,1]
#> [1,] -0.3964576
#> [2,] 1.7072412
#> [3,] 0.2984478
#>
#>
#> [[4]]
#> [[4]]$W
#> [,1] [,2] [,3]
#> [1,] -0.223680372 0.054078315 -1.2147813
#> [2,] -0.960687459 2.137340614 -1.5871052
#> [3,] 0.866691218 0.009822721 0.7965998
#> [4,] 0.005262142 -1.389131151 0.4893646
#>
#> [[4]]$b
#> [,1]
#> [1,] -0.70987165
#> [2,] 2.62558875
#> [3,] 0.09056323
#> [4,] 0.09895170
#>
#>
#> [[5]]
#> [[5]]$W
#> [,1] [,2] [,3] [,4]
#> [1,] -1.4643334 -0.6582374 -1.22302520 0.2992340
#> [2,] -1.3282413 0.2351269 -1.26016591 0.1575979
#> [3,] -0.7227838 -1.0478097 0.72466098 -1.4400574
#> [4,] -0.5859589 -1.2292912 0.06233431 -1.6657332
#> [5,] -1.0163430 -0.7884800 0.26128646 -1.1905279
#> [6,] 0.1819553 -0.1200919 -1.20793087 0.9774257
#>
#> [[5]]$b
#> [,1]
#> [1,] -0.7454942
#> [2,] 1.2403936
#> [3,] 1.6089310
#> [4,] -0.2729380
#> [5,] 1.0807913
#> [6,] 0.4677815
#>
#>
#> [[6]]
#> [[6]]$W
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.6103693 1.996691 0.68532043 2.1172797 -0.02596058 -0.9844214
#> [2,] 0.7507909 -1.030310 -0.07362092 -1.0993368 -1.51604744 -0.6425123
#> [3,] -1.1709994 2.162050 0.20287539 -0.2184209 0.29909535 1.2748073
#>
#> [[6]]$b
#> [,1]
#> [1,] 0.1430509
#> [2,] 0.9202284
#> [3,] -1.0168783
#>
#>
#> [[7]]
#> [[7]]$W
#> [,1] [,2] [,3]
#> [1,] -0.5965176 -0.8644789 -1.586006
#> [2,] 0.2476866 1.2156654 -2.022209
#>
#> [[7]]$b
#> [,1]
#> [1,] 0.2531036
#> [2,] -1.5226577
Given a neural network you may perform scalar left multiplication on a neural network. They instantiate in quite nice ways. Suppose if the neural network instantiated as \(\mathfrak{I}_{\mathsf{ReLU}}\left( \nu \right)(x)\), scalar left multiplication instantiates as:
\[ \mathfrak{I}_{\mathsf{ReLU}} \left( \lambda \triangleright \nu\right)(x) = \lambda \cdot \mathfrak{I}_{\mathsf{ReLU}}\left( \nu\right)(x) \]
Scalar left multiplication instantiates as:
\[ \mathfrak{I}_{\mathsf{ReLU}} \left(\nu\triangleleft \lambda\right)(x) = \mathfrak{I}_{\mathsf{ReLU}}\left(\nu\right)(\lambda \cdot x) \]
Here is the R code for this, compare the two outputs:
c(1,3,4,8,1) |> create_nn() -> nn
nn |> inst(ReLU,5)
#> [,1]
#> [1,] 2.057269
2 |> slm(nn) |> inst(ReLU, 5)
#> [,1]
#> [1,] 4.114537
And now for scalar right multiplication, although the difference in output is not as obvious:
Neural networks may also be stacked on top each other. The instantiation of two stacked neural networks is the concatenation of the outputs of the instantiated networks individually. In other words, mathematically we may say that:
\[ \mathfrak{I}_{\mathsf{ReLU}}\left( \nu_1 \boxminus \nu_2\right)\left( \left[x \quad y\right]^\intercal\right) = \left[ \mathfrak{I}_{\mathsf{ReLU}}\left( \nu_1\right) \left( x\right)\quad \mathfrak{I}_{\mathsf{ReLU}}\left( \nu_2\right)\left( y\right)\right]^\intercal \]
To make this more concrete observe that:
c(3,4,6,3,7,1,3,4) |> create_nn() -> nn_1
c(2,6,4,5) |> create_nn() -> nn_2
(nn_1 |> stk(nn_2)) |> inst(ReLU, c(4,3,2,1,6))
#> [,1]
#> [1,] -0.9189996
#> [2,] 0.1653912
#> [3,] 1.6666090
#> [4,] 0.8108034
#> [5,] 13.9844870
#> [6,] 18.8487898
#> [7,] -27.2170482
#> [8,] 16.9969090
#> [9,] 35.7216422
print("Compare to:")
#> [1] "Compare to:"
nn_1 |> inst(ReLU, c(4,3,2))
#> [,1]
#> [1,] -0.9189996
#> [2,] 0.1653912
#> [3,] 1.6666090
#> [4,] 0.8108034
nn_2 |> inst(ReLU, c(1,6))
#> [,1]
#> [1,] 13.98449
#> [2,] 18.84879
#> [3,] -27.21705
#> [4,] 16.99691
#> [5,] 35.72164
Note Stacking of unequal depth happens automatically by something called “tunneling”.
Neural networks may be added such that the continuous function created under ReLU instantiation is the sum of the individual instantiated functions, i.e.
\[ \mathfrak{I}_{\mathsf{ReLU}}\left( \nu \oplus \mu\right)\left( x\right) = \mathfrak{I}_{\mathsf{ReLU}}\left( \nu \right)(x) + \mathfrak{I}_{\mathsf{ReLU}}\left(\mu\right)\left( x\right) \]
The code works as follows:
c(1,5,3,2,1) |> create_nn() -> nu
c(1,5,4,9,1) |> create_nn() -> mu
nu |> inst(ReLU,4) -> x_1
mu |> inst(ReLU,4) -> x_2
x_1 + x_2 -> result_1
print("The sum of the instantiated neural network is:")
#> [1] "The sum of the instantiated neural network is:"
print(result_1)
#> [,1]
#> [1,] 29.01021
(nu |> nn_sum(mu)) |> inst(ReLU,4) -> result_2
print("The instation of their neural network sums")
#> [1] "The instation of their neural network sums"
print(result_2)
#> [,1]
#> [1,] 29.01021
Now that we have a basic operations for neural networks, we are able to go into more sophisticated functions. We start with some basic
We have neural networks for approximating squaring any real number. These only work with ReLU instantiates and most results in this field as well as most of the literature focuses on ReLU.
These must be supplied with two arguments, \(q\in (2,\infty)\) and \(\varepsilon \in (0,\infty)\). Accuracy increases the closer we are the \(2\) and \(\varepsilon\) respectively.
See the examples:
Similarly we may define the product neural network which approximates the product of two real numbers, given \(q \in (2,\infty)\), and \(\varepsilon \in (0,\infty)\). Accuracy increases the closer we are to \(2\) and \(\varepsilon\) respectively. These must be instantiated with a list of two real numbers:
Repeated applications of the \(\mathsf{Prd}^{q,\varepsilon}\). Will give us neural networks for approximating raising to a power.
These are called using the \(\mathsf{Pwr}^{q,\varepsilon}\) neural networks. These need three arguments \(q\in (2,\infty)\), \(\varepsilon \in (0,\infty)\), and \(n \in \mathbb{N}\), the power to which we are approximating. This may require significant computation time, depending on the power to which we are approximating. Here is an example:
We can do neural network sums, scalar left multiplication, and raising to a power. We thus have enough technology to create neural network polynomials, and more specifically finite power series approximations of common functions.
This is the neural network for approximating \(e^x\). These need three arguments \(q\in (2,\infty)\), \(\varepsilon \in (0,\infty)\), and \(n \in \mathbb{N}\), the power to which we will truncate the power series expansion. This may require significant computation time, depending on the power to which we are approximating. Here is an example of the code
Xpn(5,2.1,0.1) |> inst(ReLU, 2)
#> Power series approximation added for power: 1
#> Power series approximation added for power: 2
#> Power series approximation added for power: 3
#> Power series approximation added for power: 4
#> Power series approximation added for power: 5
#> [,1]
#> [1,] 7.021291
print("Compare to:")
#> [1] "Compare to:"
exp(2)
#> [1] 7.389056
By far the biggest improvement in accuracy will come from increasing the power series truncation limit. This will also contribute the most to computation time.
The length of computation time presents a challenge at this point. Steps have been taken to vectorize the code as much as possible but future work may need to be done in the direction of reimplementing this using Rcpp.
This is the neural network for approximating \(\cos(x)\). These need three arguments \(q\in (2,\infty)\), \(\varepsilon \in (0,\infty)\), and \(n \in \mathbb{N}\), the power to which we will truncate the power series expansion. This may require significant computation time, depending on the power to which we are approximating. Here is an example of the code
Csn(3,2.1,0.1) |> inst(ReLU, 0.4)
#> Power series approximation added for power: 2
#> Power series approximation added for power: 4
#> Power series approximation added for power: 6
#> [,1]
#> [1,] 0.9452746
print("Compare to:")
#> [1] "Compare to:"
cos(0.4)
#> [1] 0.921061
By far the biggest improvement in accuracy will come from increasing the power series truncation limit. This will also contribute the most to computation time.
The length of computation time presents a challenge at this point. Steps have been taken to vectorize the code as much as possible but future work may need to be done in the direction of reimplementing this using Rcpp.
This is the neural network for approximating \(\sin(x)\). These need three arguments \(q\in (2,\infty)\), \(\varepsilon \in (0,\infty)\), and \(n \in \mathbb{N}\), the power to which we will truncate the power series expansion. This may require significant computation time, depending on the power to which we are approximating. Here is an example of the code
Sne(3,2.1,0.1) |> inst(ReLU, 0.4)
#> Power series approximation added for power: 2
#> Power series approximation added for power: 4
#> Power series approximation added for power: 6
#> [,1]
#> [1,] 0.3887839
print("Compare to:")
#> [1] "Compare to:"
sin(0.4)
#> [1] 0.3894183
By far the biggest improvement in accuracy will come from increasing the power series truncation limit. This will also contribute the most to computation time.
The length of computation time presents a challenge at this point. Steps have been taken to vectorize the code as much as possible but future work may need to be done in the direction of reimplementing this using Rcpp.
A simple trapezoidal approximation can be done with neural networks given two sample points along the two legs of a trapezoid. These may be instantiated with any of the three activation functions implemented, ReLU, Sigmoid, or Tanh Observe:
h = 0.2
mesh = c(0,0+h)
samples = sin(mesh)
Trp(0.1) |> inst(ReLU, samples)
#> [,1]
#> [1,] 0.009933467
Trp(0.1) |> inst(Sigmoid, samples)
#> [,1]
#> [1,] 0.009933467
Trp(0.1) |> inst(Tanh, samples)
#> [,1]
#> [1,] 0.009933467
Now we may extend this to give use what we call the “extended trapezoid”, which will approximate the area under a curve \(f(x)\) using the trapezoidal rule once instantiated with function sample values from all of the mesh. These require two parameters \(N \in \mathbb{N}\), representing the number of trapezoids that we want to split the area under \(f(x)\) into, and \(h\) the mesh width. Note we will always have one less trapezoid than the number of meshpoints.
These may be instantiated with any of the three activation functions implemented, ReLU, Sigmoid, or Tanh.
Obseve:
Suppose you have a function \(f(x):[a,b] \rightarrow \mathbb{R}\), with a Lipschitz constant \(L\). A global Lipschitz constant is not necessary \(L\) could be the maximum upper bound of the absolute value of the slope of the function over \([a,b]\)
Take sample points \(\{x_1,x_2,...,x_n\}\) within the domain. Then let \(f_1,f_2,...,f_n\) be a family of functions, define for all \(i \in \{1,2,...,n\}\) as: \[ f_i(x) = f(x_i) - L|x-x_i|_1 \]
These would create little “hills” with the tip of the hill being at whatever points on the function where the samples were take from. We may “saw off” the base of these hills giving us a sawtooth-like function that approximates our function \(f(x)\), i.e.
\[ \hat{f} (x) = \max_{i \in \{1,2,...,n\}} \left\{ f_i \left( x\right)\right\} \]
We will call this the “maximum convolution approximation”. These must be instantiated with ReLU for maximum accuracy.
These are implemented in this package as follows: