ChatGPT解决这个技术问题 Extra ChatGPT

Neural network always predicts the same class

I'm trying to implement a neural network that classifies images into one of the two discrete categories. The problem is, however, that it currently always predicts 0 for any input and I'm not really sure why.

Here's my feature extraction method:

def extract(file):
    # Resize and subtract mean pixel
    img = cv2.resize(cv2.imread(file), (224, 224)).astype(np.float32)
    img[:, :, 0] -= 103.939
    img[:, :, 1] -= 116.779
    img[:, :, 2] -= 123.68
    # Normalize features
    img = (img.flatten() - np.mean(img)) / np.std(img)

    return np.array([img])

Here's my gradient descent routine:

def fit(x, y, t1, t2):
    """Training routine"""
    ils = x.shape[1] if len(x.shape) > 1 else 1
    labels = len(set(y))

    if t1 is None or t2 is None:
        t1 = randweights(ils, 10)
        t2 = randweights(10, labels)

    params = np.concatenate([t1.reshape(-1), t2.reshape(-1)])
    res = grad(params, ils, 10, labels, x, y)
    params -= 0.1 * res

    return unpack(params, ils, 10, labels)

Here are my forward and back(gradient) propagations:

def forward(x, theta1, theta2):
    """Forward propagation"""

    m = x.shape[0]

    # Forward prop
    a1 = np.vstack((np.ones([1, m]), x.T))
    z2 = np.dot(theta1, a1)

    a2 = np.vstack((np.ones([1, m]), sigmoid(z2)))
    a3 = sigmoid(np.dot(theta2, a2))

    return (a1, a2, a3, z2, m)

def grad(params, ils, hls, labels, x, Y, lmbda=0.01):
    """Compute gradient for hypothesis Theta"""

    theta1, theta2 = unpack(params, ils, hls, labels)

    a1, a2, a3, z2, m = forward(x, theta1, theta2)
    d3 = a3 - Y.T
    print('Current error: {}'.format(np.mean(np.abs(d3))))

    d2 = np.dot(theta2.T, d3) * (np.vstack([np.ones([1, m]), sigmoid_prime(z2)]))
    d3 = d3.T
    d2 = d2[1:, :].T

    t1_grad = np.dot(d2.T, a1.T)
    t2_grad = np.dot(d3.T, a2.T)

    theta1[0] = np.zeros([1, theta1.shape[1]])
    theta2[0] = np.zeros([1, theta2.shape[1]])

    t1_grad = t1_grad + (lmbda / m) * theta1
    t2_grad = t2_grad + (lmbda / m) * theta2

    return np.concatenate([t1_grad.reshape(-1), t2_grad.reshape(-1)])

And here's my prediction function:

def predict(theta1, theta2, x):
    """Predict output using learned weights"""
    m = x.shape[0]

    h1 = sigmoid(np.hstack((np.ones([m, 1]), x)).dot(theta1.T))
    h2 = sigmoid(np.hstack((np.ones([m, 1]), h1)).dot(theta2.T))

    return h2.argmax(axis=1)

I can see that the error rate is gradually decreasing with each iteration, generally converging somewhere around 1.26e-05.

What I've tried so far:

PCA Different datasets (Iris from sklearn and handwritten numbers from Coursera ML course, achieving about 95% accuracy on both). However, both of those were processed in a batch, so I can assume that my general implementation is correct, but there is something wrong with either how I extract features, or how I train the classifier. Tried sklearn's SGDClassifier and it didn't perform much better, giving me a ~50% accuracy. So something wrong with the features, then?

Edit: An average output of h2 looks like the following:

[0.5004899   0.45264441]
[0.50048522  0.47439413]
[0.50049019  0.46557124]
[0.50049261  0.45297816]

So, very similar sigmoid outputs for all validation examples.

One thought, are you randomizing your training set? If there are a bunch of the 0 class in the first batches it is possible it becomes focused on them very early.
The data is ordered, i.e.: 10000 of 0s, then 10000 of 1s.
Just realized you said 'batch'. I think I was confusing with 'mini-batch' where this is a common problem. I will need to think about this some more.
Just FYI: I tried randomizing the input data and the result is still the same.
Try returning the raw h2 values from your final predict call. Are they all the same as well?

M
Martin Thoma

My network does always predict the same class. What is the problem?

I had this a couple of times. Although I'm currently too lazy to go through your code, I think I can give some general hints which might also help others who have the same symptom but probably different underlying problems.

Debugging Neural Networks

Fitting one item datasets

For every class i the network should be able to predict, try the following:

Create a dataset of only one data point of class i. Fit the network to this dataset. Does the network learn to predict "class i"?

If this doesn't work, there are four possible error sources:

Buggy training algorithm: Try a smaller model, print a lot of values which are calculated in between and see if those match your expectation. Dividing by 0: Add a small number to the denominator Logarithm of 0 / negativ number: Like dividing by 0 Data: It is possible that your data has the wrong type. For example, it might be necessary that your data is of type float32 but actually is an integer. Model: It is also possible that you just created a model which cannot possibly predict what you want. This should be revealed when you try simpler models. Initialization / Optimization: Depending on the model, your initialization and your optimization algorithm might play a crucial role. For beginners who use standard stochastic gradient descent, I would say it is mainly important to initialize the weights randomly (each weight a different value). - see also: this question / answer

Learning Curve

See sklearn for details.

https://i.stack.imgur.com/mhAUB.png

The idea is to start with a tiny training dataset (probably only one item). Then the model should be able to fit the data perfectly. If this works, you make a slightly larger dataset. Your training error should slightly go up at some point. This reveals your models capacity to model the data.

Data analysis

Check how often the other class(es) appear. If one class dominates the others (e.g. one class is 99.9% of the data), this is a problem. Look for "outlier detection" techniques.

More

Learning rate: If your network doesn't improve and get only slightly better than random chance, try reducing the learning rate. For computer vision, a learning rate of 0.001 is often used / working. This is also relevant if you use Adam as an optimizer.

Preprocessing: Make sure you use the same preprocessing for training and testing. You might see differences in the confusion matrix (see this question)

Common Mistakes

This is inspired by reddit:

You forgot to apply preprocessing

Dying ReLU

Too small / too big learning rate

Wrong activation function in final layer: Your targets are not in sum one? -> Don't use softmax Single elements of your targets are negative -> Don't use Softmax, ReLU, Sigmoid. tanh might be an option

Your targets are not in sum one? -> Don't use softmax

Single elements of your targets are negative -> Don't use Softmax, ReLU, Sigmoid. tanh might be an option

Too deep network: You fail to train. Try a simpler neural network first.

Vastly unbalanced data: You might want to look into imbalanced-learn


Single elements of your targets are negative -> Don't use Softmax, ReLU, Sigmoid. tanh might be an option. Can you please suggest then the correct activation function in this case?
Did you see that I suggest tanh? What else do you expect? (You can always design your own ones; sometimes linear is also a good option)
i misread. I thought tanh is in the list of functions not to use. Maybe it should be Tanh, as it is first word in the sentence
"Your targets are not in sum one? -> Don't use softmax" and "Single elements of your targets are negative " - what exactly do you mean by "targets"?
Thank you sir, I had a learning rate problem. You saved me. As a general debugging or model building step, to anyone who reads this, I'd always suggest to start with something trivial, and build your model changing only one stuff at a time, while keeping an eye on your metric. With too many variables, it'll always get trickier to pinpoint an issue.
Y
Yurii Dolhikh

After a week and a half of research I think I understand what the issue is. There is nothing wrong with the code itself. The only two issues that prevent my implementation from classifying successfully are time spent learning and proper selection of learning rate / regularization parameters.

I've had the learning routine running for some tome now, and it's pushing 75% accuracy already, though there is still plenty of space for improvement.


Could you tell me how much you had it running for before and how much after you noticed it? I am having some problems myself, but even after some more time It didnt seem to correct itself end still only predicted the same class over and over..
same problem here too
I was having the same problem. tried using a learning rate with a scheduler, and trained it for more number of epochs, and managed to overfit my data with 100% accuracy after 500 epochs
For anyone looking with the same issues, you need to spend more time tweaking the LR -- this is the answer.
is typically the learning rate too high or too low that causes this problem?
T
Tommaso Di Noto

Same happened to me. I had an imbalanced dataset (about 66%-33% sample distribution between classes 0 and 1, respectively) and the net was always outputting 0.0 for all samples after the first iteration.

My problem was simply a too high learning rate. Switching it to 1e-05 solved the issue.

More generally, what I suggest to do is to print, before the parameters' update:

your net output (for one batch)

the corresponding label (for the same batch)

the value of the loss (on the same batch) either sample by sample or aggregated.

And then check the same three items after the parameter update. What you should see in the next batch is a gradual change in the net output. When my learning rate was too high, already in the second iteration the net output would shoot to either all 1.0s or all 0.0s for all samples in the batch.


U
Urmay Shah

Same happened to me. Mine was in deeplearning4j JAVA library for image classification.It kept on giving the final output of the last training folder for every test. I was able to solve it by decreasing the learning rate.

Approaches can be used :

Lowering the learning rate. (First mine was 0.01 - lowering to 1e-4 and it worked) Increasing Batch Size (Sometimes stochastic gradient descent doesn't work then you can try giving more batch size(32,64,128,256,..) Shuffling the training Data


In my case, i solved with your solution. I switched learning rate 0.001 to 0.0001. Thanks.
Thanks for your suggestion. The batch size was the problem in my model.
L
Lucky Ning

I came across the problem that model always predict the same label.It confused me for a week.At last ,I solved it by replacing the RELU with other activation function.The RELU will cause the "Dying ReLU" problem.

Before I solved the problem.I tried :

check the postive and negtive samples rate,from 1:25 to 1:3. But it doesn't work change batchsize and learning rate and other loss.But it doesn't work

Finally I find that descrese the learning rate from 0.005 to 0.0002 is already valid.


A
Atik Faysal

Same happened to me. The model was predicting one class only for seven class CNN. I tried to change the activation function, batch size but nothing worked. Then changing the learning rate worked for me too.

opt = keras.optimizers.Adam(learning_rate=1e-06)

As you can see, I had to chose a very low learning rate. My number of training samples is 5250 and validation samples 1575.


L
LiNKeR

Just incase some one else encounters this problem. Mine was with a deeplearning4j Lenet(CNN) architecture, It kept on giving the final output of the last training folder for every test. I was able to solve it by increasing my batchsize and shuffling the training data so each batch contained at least a sample from more than one folder. My data class had a batchsize of 1 which was really dangerous.

Edit: Although another thing I observed recently is having limited sets of training samples per class despite having a large dataset. e.g. training a neural-network to recognise human faces but having only a maximum of say 2 different faces for 1 person mean while the dataset consists of say 10,000 persons thus a dataset of 20,000 faces in total. A better dataset would be 1000 different faces for 10,000 persons thus a dataset of 10,000,000 faces in total. This is relatively necessary if you want avoid overfitting the data to one class so your network can easily generalise and produce better predictions.


Y
Yinon_90

I also had the same problem, I do binary classification by using transfer learning with ResNet50, I was able to solve it by replacing:

Dense(output_dim=2048, activation= 'relu')

with

Dense(output_dim=128, activation= 'relu')

and also by removing Keras Augmentation and retrain last layers of RestNet50


g
glitch

After trying many solutions, it turns out that the problem for me was with the the prediction phase not the training nor model architecture. The method which I was using for prediction was showing zeros for all cases even though I have relatively high validation accuracy because this line:

predicted_class_indices=np.argmax(scores,axis=1)

If you are dealing with binary classification, try:

predict = model.predict(
    validation_generator, steps=None, callbacks=None, max_queue_size=10, workers=1,
    use_multiprocessing=False, verbose=0
)

K
K.Steven

the TOPUP answer really work for me. My circumstance is while I am training the model of bert4reco with a large dataset( 4million+ samples), the acc and log_loss always stay between 0.5 and 0.8 during the whole epoch (It cost 8 hour, I print the result every 100 steps). Then I use a very small scale dataset, and a smaller model, finally it works! the model start to learn something, acc and log_loss begin to increase and reach a convergence after 300 epoches !

Conclusively, the TOPUP answer is a good checklist for these kind of questions. And sometime if you can not see any changes in the begining of train, that maybe it will take a lot of time for your model to really learn something. It would better to user mini dataset to assert this, and after that you can wait for it to learn or use some effective equipment such as GPUs or TPUs


关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now