ChatGPT解决这个技术问题 Extra ChatGPT

multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer? [closed]

Closed. This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed last year. Improve this question

If we have 10 eigenvectors then we can have 10 neural nodes in input layer.If we have 5 output classes then we can have 5 nodes in output layer.But what is the criteria for choosing number of hidden layer in a MLP and how many neural nodes in 1 hidden layer?


s
styvane

how many hidden layers?

a model with zero hidden layers will resolve linearly separable data. So unless you already know your data isn't linearly separable, it doesn't hurt to verify this--why use a more complex model than the task requires? If it is linearly separable then a simpler technique will work, but a Perceptron will do the job as well.

Assuming your data does require separation by a non-linear technique, then always start with one hidden layer. Almost certainly that's all you will need. If your data is separable using a MLP, then that MLP probably only needs a single hidden layer. There is theoretical justification for this, but my reason is purely empirical: Many difficult classification/regression problems are solved using single-hidden-layer MLPs, yet I don't recall encountering any multiple-hidden-layer MLPs used to successfully model data--whether on ML bulletin boards, ML Textbooks, academic papers, etc. They exist, certainly, but the circumstances that justify their use is empirically quite rare.

How many nodes in the hidden layer?

From the MLP academic literature. my own experience, etc., I have gathered and often rely upon several rules of thumb (RoT), and which I have also found to be reliable guides (ie., the guidance was accurate, and even when it wasn't, it was usually clear what to do next):

RoT based on improving convergence:

When you begin the model building, err on the side of more nodes in the hidden layer.

Why? First, a few extra nodes in the hidden layer isn't likely do any any harm--your MLP will still converge. On the other hand, too few nodes in the hidden layer can prevent convergence. Think of it this way, additional nodes provides some excess capacity--additional weights to store/release signal to the network during iteration (training, or model building). Second, if you begin with additional nodes in your hidden layer, then it's easy to prune them later (during iteration progress). This is common and there are diagnostic techniques to assist you (e.g., Hinton Diagram, which is just a visual depiction of the weight matrices, a 'heat map' of the weight values,).

RoTs based on size of input layer and size of output layer:

A rule of thumb is for the size of this [hidden] layer to be somewhere between the input layer size ... and the output layer size.... To calculate the number of hidden nodes we use a general rule of: (Number of inputs + outputs) x 2/3

RoT based on principal components:

Typically, we specify as many hidden nodes as dimensions [principal components] needed to capture 70-90% of the variance of the input data set.

And yet the NN FAQ author calls these Rules "nonsense" (literally) because they: ignore the number of training instances, the noise in the targets (values of the response variables), and the complexity of the feature space.

In his view (and it always seemed to me that he knows what he's talking about), choose the number of neurons in the hidden layer based on whether your MLP includes some form of regularization, or early stopping.

The only valid technique for optimizing the number of neurons in the Hidden Layer:

During your model building, test obsessively; testing will reveal the signatures of "incorrect" network architecture. For instance, if you begin with an MLP having a hidden layer comprised of a small number of nodes (which you will gradually increase as needed, based on test results) your training and generalization error will both be high caused by bias and underfitting.

Then increase the number of nodes in the hidden layer, one at a time, until the generalization error begins to increase, this time due to overfitting and high variance.

In practice, I do it this way:

input layer: the size of my data vactor (the number of features in my model) + 1 for the bias node and not including the response variable, of course

output layer: soley determined by my model: regression (one node) versus classification (number of nodes equivalent to the number of classes, assuming softmax)

hidden layer: to start, one hidden layer with a number of nodes equal to the size of the input layer. The "ideal" size is more likely to be smaller (i.e, some number of nodes between the number in the input layer and the number in the output layer) rather than larger--again, this is just an empirical observation, and the bulk of this observation is my own experience. If the project justified the additional time required, then I start with a single hidden layer comprised of a small number of nodes, then (as i explained just above) I add nodes to the Hidden Layer, one at a time, while calculating the generalization error, training error, bias, and variance. When generalization error has dipped and just before it begins to increase again, the number of nodes at that point is my choice. See figure below.

https://i.stack.imgur.com/LO1Id.png


I would like to add some related results regarding the #1 RoT: In the successful SVMs you actually map your input to a higher dimensional space (more hidden nodes than nodes in the input layer in NN parlance). The output layer's job is to get the decision from this over-complete representation. There might be a connection to Random Projections as well. The brilliant paper of Adam Coates & Andrew Y. Ng (2011) discusses related topics.
Nice explanation. Any idea on how I could plot a figure like the above when using sklearn and MLPClassifier ?
@sera you mean xkcd style?
In principle, could you automate the process of optimizing the number of neurons in the hidden layer? Also, could you optimize the number of hidden layers automatically too?
J
Justas

To automate the selection of the best number of layers and best number of neurons for each of the layers, you can use genetic optimization.

The key pieces would be:

Chromosome: Vector that defines how many units in each hidden layer (e.g. [20,5,1,0,0] meaning 20 units in first hidden layer, 5 in second, ... , with layers 4 and 5 missing). You can set a limit on the maximum number number of layers to try, and the max number of units in each layer. You should also place restrictions of how the chromosomes are generated. E.g. [10, 0, 3, ... ] should not be generated, because any units after a missing layer (the '3,...') would be irrelevant and would waste evaluation cycles. Fitness Function: A function that returns the reciprocal of the lowest training error in the cross-validation set of a network defined by a given chromosome. You could also include the number of total units, or computation time if you want to find the "smallest/fastest yet most accurate network".

You can also consider:

Pruning: Start with a large network, then reduce the layers and hidden units, while keeping track of cross-validation set performance.

Growing: Start with a very small network, then add units and layers, and again keep track of CV set performance.


O
Ove

It is very difficult to choose the number of neurons in a hidden layer, and to choose the number of hidden layers in your neural network.

Usually, for most applications, one hidden layer is enough. Also, the number of neurons in that hidden layer should be between the number of inputs (10 in your example) and the number of outputs (5 in your example).

But the best way to choose the number of neurons and hidden layers is experimentation. Train several neural networks with different numbers of hidden layers and hidden neurons, and measure the performance of those networks using cross-validation. You can stick with the number that yields the best performing network.


m
mlstudent

Recently there is theoretical work on this https://arxiv.org/abs/1809.09953. Assuming you use a RELU MLP, all hidden layers have the same number of nodes and your loss function and true function that you're approximating with a neural network obey some technical properties (in the paper), you can choose your depth to be of order $\log(n)$ and your width of hidden layers to be of order $n^{d/(2(\beta+d))}\log^2(n)$. Here $n$ is your sample size, $d$ is the dimension of your input vector, and $\beta$ is a smoothness parameter for your true function. Since $\beta$ is unknown, you will probably want to treat it as a hyperparameter.

Doing this you can guarantee that with probability that converges to $1$ as function of sample size your approximation error converges to $0$ as a function of sample size. They give the rate. Note that this isn't guaranteed to be the 'best' architecture, but it can at least give you a good place to start with. Further, my own experience suggests that things like dropout can still help in practice.