获取卷积算法失败。这可能是因为 cuDNN 初始化失败，

python tensorflow keras

在 Tensorflow/Keras 中，从 https://github.com/pierluigiferrari/ssd_keras 运行代码时，使用估算器：ssd300_evaluation。我收到了这个错误。

获取卷积算法失败。这可能是因为 cuDNN 初始化失败，因此请尝试查看上面是否打印了警告日志消息。

这与未解决的问题非常相似：Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize

对于我正在运行的问题：

蟒蛇：3.6.4。

张量流版本：1.12.0。

Keras 版本：2.2.4。

CUDA：V10.0。

cuDNN：V7.4.1.5。

英伟达 GeForce GTX 1080。

我也跑了：

import tensorflow as tf
with tf.device('/gpu:0'):
      a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
      b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
      c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))

没有错误或问题。

极简主义的例子是：

 from keras import backend as K
 from keras.models import load_model
 from keras.optimizers import Adam
 from scipy.misc import imread
 import numpy as np
 from matplotlib import pyplot as plt

 from models.keras_ssd300 import ssd_300
 from keras_loss_function.keras_ssd_loss import SSDLoss
 from keras_layers.keras_layer_AnchorBoxes import AnchorBoxes
 from keras_layers.keras_layer_DecodeDetections import DecodeDetections
 from keras_layers.keras_layer_DecodeDetectionsFast import DecodeDetectionsFast
 from keras_layers.keras_layer_L2Normalization import L2Normalization
 from data_generator.object_detection_2d_data_generator import DataGenerator
 from eval_utils.average_precision_evaluator import Evaluator
 import tensorflow as tf
 %matplotlib inline
 import keras
 keras.__version__



 # Set a few configuration parameters.
 img_height = 300
 img_width = 300
 n_classes = 20
 model_mode = 'inference'


 K.clear_session() # Clear previous models from memory.

 model = ssd_300(image_size=(img_height, img_width, 3),
            n_classes=n_classes,
            mode=model_mode,
            l2_regularization=0.0005,
            scales=[0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05], # The scales 
 for MS COCO [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05]
            aspect_ratios_per_layer=[[1.0, 2.0, 0.5],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5],
                                     [1.0, 2.0, 0.5]],
            two_boxes_for_ar1=True,
            steps=[8, 16, 32, 64, 100, 300],
            offsets=[0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
            clip_boxes=False,
            variances=[0.1, 0.1, 0.2, 0.2],
            normalize_coords=True,
            subtract_mean=[123, 117, 104],
            swap_channels=[2, 1, 0],
            confidence_thresh=0.01,
            iou_threshold=0.45,
            top_k=200,
            nms_max_output_size=400)

 # 2: Load the trained weights into the model.

 # TODO: Set the path of the trained weights.
 weights_path = 'C:/Users/USAgData/TF SSD 
 Keras/weights/VGG_VOC0712Plus_SSD_300x300_iter_240000.h5'

 model.load_weights(weights_path, by_name=True)

 # 3: Compile the model so that Keras won't complain the next time you load it.

 adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

 ssd_loss = SSDLoss(neg_pos_ratio=3, alpha=1.0)

 model.compile(optimizer=adam, loss=ssd_loss.compute_loss)


dataset = DataGenerator()

# TODO: Set the paths to the dataset here.
dir= "C:/Users/USAgData/TF SSD Keras/VOC/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/"
Pascal_VOC_dataset_images_dir = dir+ 'JPEGImages'
Pascal_VOC_dataset_annotations_dir = dir + 'Annotations/'
Pascal_VOC_dataset_image_set_filename = dir+'ImageSets/Main/test.txt'

# The XML parser needs to now what object class names to look for and in which order to map them to integers.
classes = ['background',
           'aeroplane', 'bicycle', 'bird', 'boat',
           'bottle', 'bus', 'car', 'cat',
           'chair', 'cow', 'diningtable', 'dog',
           'horse', 'motorbike', 'person', 'pottedplant',
           'sheep', 'sofa', 'train', 'tvmonitor']

dataset.parse_xml(images_dirs=[Pascal_VOC_dataset_images_dir],
                  image_set_filenames=[Pascal_VOC_dataset_image_set_filename],
                  annotations_dirs=[Pascal_VOC_dataset_annotations_dir],
                  classes=classes,
                  include_classes='all',
                  exclude_truncated=False,
                  exclude_difficult=False,
                  ret=False)



evaluator = Evaluator(model=model,
                      n_classes=n_classes,
                      data_generator=dataset,
                      model_mode=model_mode)



results = evaluator(img_height=img_height,
                    img_width=img_width,
                    batch_size=8,
                    data_generator_mode='resize',
                    round_confidences=False,
                    matching_iou_threshold=0.5,
                    border_pixels='include',
                    sorting_algorithm='quicksort',
                    average_precision_mode='sample',
                    num_recall_points=11,
                    ignore_neutral_boxes=True,
                    return_precisions=True,
                    return_recalls=True,
                    return_average_precisions=True,
                    verbose=True)

如果使用 Conda 环境，在我的情况下，问题是通过安装 tensorflow-gpu 和 not CUDAtoolkit 或 cuDNN 解决的，因为它们已经由 tensorflow-gpu 安装（请参阅此 answer）。但请注意，新的 conda tensorflow-gpu 版本可能不会安装 CUDAtoolkit 或 cuDNN ->解决方案是安装较低版本的 tensorflow-gpu，然后使用 pip 进行升级（参见此answer）。

waterproof

由于三个不同的原因，我看到了这个错误消息，有不同的解决方案：

1.你有缓存问题

我经常通过关闭我的 python 进程、删除 ~/.nv 目录（在 linux 上，rm -rf ~/.nv）并重新启动 Python 进程来解决此错误。我不完全知道为什么会这样。它可能至少部分与第二个选项有关：

2.你内存不足

如果显卡 RAM 用完，也会出现该错误。使用 nvidia GPU，您可以使用 nvidia-smi 检查显卡内存使用情况。这将为您提供正在使用的 GPU RAM 量的读数（如果您几乎达到极限，则类似于 6025MiB / 6086MiB）以及哪些进程正在使用 GPU RAM 的列表。

如果您的 RAM 已用完，您将需要重新启动该进程（这应该可以释放 RAM），然后采取一种内存密集度较低的方法。几个选项是：

减少批量大小

使用更简单的模型

使用更少的数据

限制 TensorFlow GPU 内存分数：例如，以下将确保 TensorFlow 使用 <= 90% 的 RAM：

import keras
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9  # 0.6 sometimes works better for folks
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))

如果不与上述项目一起使用，这可能会减慢您的模型评估，大概是因为必须换入和换出大型数据集以适应您分配的少量内存。

第二种选择是让 TensorFlow 一开始只使用最少量的内存，然后根据需要分配更多内存（记录在 here）：

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

3. CUDA、TensorFlow、NVIDIA驱动等版本不兼容。

如果您从未使用过类似的模型，那么您并没有用完 VRAM 并且您的缓存是干净的，我会返回并使用最好的可用安装指南设置 CUDA + TensorFlow - 我遵循 https://www.tensorflow.org/install/gpu 上的说明而不是 NVIDIA / CUDA 网站上的说明取得了最大的成功。 Lambda Stack 也是一个不错的方法。

我赞成这个答案，因为对我来说，我只是内存不足。

就我而言，它是不兼容的版本。如果您密切注意 = 或 >= 等运算符，说明 tensorflow.org/install/gpu 是准确的。最初我假设“相等或更新”，但使用 TensorFlow 2.2（似乎需要像 2.1 一样对待），您需要完全 CUDA 10.1 和 >= CuDNN 7.6，它与 CUDA 10.1 兼容（目前，这是只有 7.6.5 - CUDA 10.2 和 10.1 有两个不同的版本。

这对我来说也是记忆。感谢您的深入解释。

就我而言，它内存不足。您的 0.6 代码对我有用 [per_process_gpu_memory_fraction = 0.6]。谢谢

我一直没有记忆。后台进程占用了我所有的 GPU 内存。使用 htop 和 nvidia-smi 交叉检查进程 ID

Bensuperpc

我有同样的问题，我解决了这个问题：

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

或者

physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
   tf.config.experimental.set_memory_growth(physical_devices[0], True)

第一个解决方案像魔术一样解决了它。因此可能无法解决问题的根源。

目前这似乎是一个非常普遍的问题，我在 GitHub 和 Medium 上找到了类似的解决方案。也为我工作，所以可能是当前 TF 或 CuDNN 版本的问题，而不是不正确的安装。无论大小如何，这都是 CNN 层的具体问题。其他操作/层都可以。

第一个解决方案对我也很有效。

谢谢！这个解决方案也对我有用。我只是在这里使用了最高投票答案的收据（重新安装除外），但它没有用。我想从这个线程中描述的所有措施创建一个收据以巩固它是一个好主意。

gatefun

我遇到了这个错误，我通过从我的系统中卸载所有 CUDA 和 cuDNN 版本来修复它。然后我为 CUDA 9.0 安装了 CUDA Toolkit 9.0（没有任何补丁）和 cuDNN v7.4.1。

您还可以降级 TensorFlow 版本

我遇到了同样的错误，出现此错误的原因是由于 cudaa/cudnn 的版本与您的 tensorflow 版本不匹配，有两种方法可以解决此问题：要么您降级您的 Tensorflow 版本 pip install --upgrade tensorflowgpu== 1.8.0 或者您可以按照tensorflow.org/install/gpu中的步骤提示：选择您的 Ubuntu 版本并按照步骤操作。:-)

对我来说，这是 CUDA 和 cuDNN 之间的不匹配。用匹配的版本替换 cuDNN 库解决了这个问题。

这不是实际的解决方案，它只是以某种方式帮助您查看实际解决方案的 stackoverflow.com/questions/53698035/…。

我如何下载适用于 windows 10 的 cudatookkit 9.0？

Rheatey Bash

我在使用 CuDNN v 8.0.4 的 Tensorflow 2.4 和 Cuda 11.0 时也遇到了同样的问题。我浪费了将近 2 到 3 天来解决这个问题。问题只是驱动程序不匹配。我正在安装 Cuda 11.0 Update 1，我认为这是更新 1，所以可能运行良好，但这是罪魁祸首。我卸载了 Cuda 11.0 Update 1 并在没有更新的情况下安装了它。以下是适用于 RTX 2060 6GB GPU 的 TensorFlow 2.4 的驱动程序列表。

cuDNN v8.0.4 for CUDA 11.0 选择首选操作系统并下载

CUDA Toolkit 11.0 选择您的操作系统

提到了所需的硬件和软件要求列表here

我也必须这样做

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

避免这个错误

2020-12-23 21:54:14.971709: I tensorflow/stream_executor/stream.cc:1404] [stream=000001E69C1DA210,impl=000001E6A9F88E20] did not wait for [stream=000001E69C1DA180,impl=000001E6A9F88730]
2020-12-23 21:54:15.211338: F tensorflow/core/common_runtime/gpu/gpu_util.cc:340] CPU->GPU Memcpy failed
[I 21:54:16.071 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel 8b907ea5-33f1-4b2a-96cc-4a7a4c885d74 restarted
kernel 8b907ea5-33f1-4b2a-96cc-4a7a4c885d74 restarted

这些是我得到的一些错误样本

类型 1

UnpicklingError: invalid load key, 'H'.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-2-f049ceaad66a> in <module>

类型 2


InternalError: Blas GEMM launch failed : a.shape=(15, 768), b.shape=(768, 768), m=15, n=768, k=768 [Op:MatMul]

During handling of the above exception, another exception occurred:

类型 3

failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.534375: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.534683: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.534923: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.539327: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.539523: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.539665: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_fused_impl.h:697 : Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

像魅力一样工作。谢谢

Gahan

Keras 包含在上面的 TensorFlow 2.0 中。所以

删除导入 keras 和

将 from keras.module.module import class 语句替换为 --> from tensorflow.keras.module.module import class

也许您的 GPU 内存已满。所以在 GPU 选项中使用 allow growth = True 。现在已弃用。但是在导入后使用下面的代码片段可能会解决您的问题。

import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
sess = tf.compat.v1.Session(config=config)
set_session(sess)

感谢您的完美答案！它对我有很大帮助。

Mainak Dutta

问题在于较新版本的 tensorflow 1.10.x plus 版本与 cudnn 7.0.5 和 cuda 9.0 不兼容。最简单的解决方法是将 tensorflow 降级到 1.8.0

pip install --upgrade tensorflow-gpu==1.8.0

Ralph Bisschops

这是对 https://stackoverflow.com/a/56511889/2037998 第 2 点的跟进。

2.你内存不足

我使用以下代码来限制 GPU RAM 的使用：

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1*X GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=(1024*4))])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

此代码示例来自：TensorFlow: Use a GPU: Limiting GPU memory growth将此代码放在您正在使用的任何其他 TF/Keras 代码之前。

注意：该应用程序使用的 GPU RAM 可能仍比上述数字多一点。

注意 2：如果系统还运行其他应用程序（如 UI），这些程序也会消耗一些 GPU RAM。（Xorg、Firefox、...有时加起来高达 1GB 的 GPU RAM）

Vidit Varshney

我得到了同样的错误，出现此错误的原因是由于 cudaa/cudnn 的版本与您的 tensorflow 版本不匹配，有两种方法可以解决此问题：

您可以降级您的 Tensorflow 版本 pip install --upgrade tensorflowgpu==1.8.0 或者您可以按照此处的步骤操作。提示：选择您的 ubuntu 版本并按照步骤操作。:-)

Gangadhar S

我在使用 RTX 2080 时遇到了同样的问题。然后下面的代码对我有用。

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Haziq Sheikh

我遇到了同样的问题，但在开始时添加这些代码行解决了我的问题：

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

适用于 tensorflow V2。

在 CentOS 7 中使用 tensorflow-gpu 2.2、cuda 10.2 和 cudnn 7.4.2 对我不起作用，并且错误要我安装 cudnn 7.6.4

@MonaJalal 您可以降级 TensorFlow 或升级您的 CUDNN 以获得兼容性检查此链接：tensorflow.org/install/source#gpu

Karthikeyan Sise

只需添加

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

添加from tensorflow.compat.v1 import ConfigProto

RadV

升级到 TF2.0 后我遇到了这个问题。以下开始给出错误：

   outputs = tf.nn.conv2d(images, filters, strides=1, padding="SAME")

我正在使用 Ubuntu 16.04.6 LTS（Azure 数据科学 VM）和 TensorFlow 2.0。在此 TensorFlow GPU 指令上按指令升级page。这为我解决了这个问题。顺便说一句，它的一堆 apt-get 更新/安装，我执行了所有这些。

我为 Ubuntu 18.04 做了同样的事情，现在一切正常。但是现在当我在终端中运行 nvidia-smi 时，它会显示 CUDA 10.2。但是 here 它说 Tensorflow 2.0 与 CUDA 10.0 兼容。我不明白一切如何运作？终端中 which nvcc 的输出给出 /usr/local/cuda-10.0/bin/nvcc

所以我认为有 2 个独立的 CUDA，一个用于 nvidia 驱动程序，另一个用于基本环境。

我认为应该是。我没有仔细注意到显示的 CUDA 版本。我的环境已经改变，现在我无法再检查了。有趣的信息。谢谢你。

Emrullah Çelik

我有同样的问题。我正在使用 conda 环境，所以我的包由 conda 自动管理。我通过限制tensorflow v2、python 3.x的内存分配解决了这个问题

physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
tf.config.experimental.set_memory_growth(physical_devices[0], True)

这解决了我的问题。但是，这极大地限制了内存。当我同时运行

nvidia-smi

我看到它大约是700mb。因此，为了查看更多选项，可以检查 tensorflow's website 处的代码

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

就我而言，上面的代码片段完美地解决了这个问题。

注意：我没有尝试使用 pip 安装 tensorflow，这与 conda 安装的 tensorflow 一起有效。

Ubuntu：18.04

蟒蛇：3.8.5

张量流：2.2.0

cudnn：7.6.5

cudatoolkit：10.1.243

Laurin Herbsthofer

正如上面的 Anurag Bhalekar 所观察到的，这可以通过在代码中设置和运行模型来解决，然后再使用 keras 的 load_model() 加载旧模型。这似乎正确初始化了 cuDNN，然后可以将其用于 load_model()。

就我而言，我使用 Spyder IDE 来运行我所有的 python 脚本。具体来说，我在一个脚本中设置、训练和保存 CNN。之后，另一个脚本加载保存的模型以进行可视化。如果我打开 Spyder 并直接运行可视化脚本来加载一个旧的、保存的模型，我会得到与上面提到的相同的错误。我仍然能够加载模型并对其进行修改，但是当我尝试创建预测时，我得到了错误。

但是，如果我首先在 Spyder 实例中运行我的训练脚本，然后在同一个 Sypder 实例中运行可视化脚本，它可以正常工作而没有任何错误：

#training a model correctly initializes cuDNN
model=Sequential()
model.add(Conv2D(32,...))
model.add(Dense(num_classes,...))
model.compile(...)
model.fit() #this all works fine

然后，以下代码包括 load_model() 工作正常：

#this script relies on cuDNN already being initialized by the script above
from keras.models import load_model
model = load_model(modelPath) #works
model = Model(inputs=model.inputs, outputs=model.layers[1].output) #works
feature_maps = model.predict(img) #produces the error only if the first piece of code is not run

我不知道为什么会这样或如何以不同的方式解决问题，但对我来说，在使用 load_model() 之前训练一个小型工作 keras 模型是一种快速而肮脏的修复，不需要重新安装 cuDNN 或其他方式.

abdul

面临同样的问题，我认为 GPU 无法一次加载所有数据。我通过减少批量大小来解决它。

Paktalin

我为这个问题苦苦挣扎了一周。原因很傻：我用高分辨率照片来训练。

希望这会节省某人的时间:)

kHarshit

如果存在不兼容的 cuDNN 版本，也可能会出现此问题，如果您使用 conda 安装 Tensorflow，则可能会出现这种情况，因为 conda 在安装 Tensorflow 时也会安装 CUDA 和 cuDNN。

解决方案是使用 pip 安装 Tensorflow，并在没有 conda 的情况下单独安装 CUDA 和 cuDNN，例如，如果您有 CUDA 10.0.130 和 cuDNN 7.4.1 (tested configurations)，那么

pip install tensorflow-gpu==1.13.1

AndrewPt

1) 关闭所有其他使用 GPU 的笔记本

2) TF 2.0 需要 cuDNN SDK (>= 7.4.1)

提取'bin'文件夹的路径并将其添加到“环境变量/系统变量/路径”中：“D:\Programs\x64\Nvidia\cudnn\bin”

Vasco Cansado Carvalho

我遇到了同样的问题，但解决方案比这里发布的其他人更简单。我同时安装了 CUDA 10.0 和 10.2，但我只有 10.2 的 cuDNN，而且这个版本 [在本文发表时] 与 TensorFlow GPU 不兼容。我刚刚为 CUDA 10.0 安装了 cuDNN，现在一切正常！

Sivakumar D

解决方法：全新安装 TF 2.0 并运行了一个简单的 Minst 教程，没问题，打开另一个笔记本，尝试运行，遇到了这个问题。我存在所有笔记本并重新启动 Jupyter 并仅打开一个笔记本，成功运行问题似乎是内存或在 GPU 上运行多个笔记本

谢谢

BenedictGrain

我和你有同样的问题，我的配置是 tensorflow1.13.1,cuda10.0,cudnn7.6.4。我尝试将cudnn的版本更改为7.4.2幸运，我解决了问题。

DEEPAK S.V.

在我的代码开始时启用 GPU 上的内存增长解决了这个问题：

import tensorflow as tf

physical_devices = tf.config.experimental.list_physical_devices('GPU')
print("Num GPUs Available: ", len(physical_devices))
tf.config.experimental.set_memory_growth(physical_devices[0], True)

可用 GPU 数量：1

参考：https://deeplizard.com/learn/video/OO4HD-1wRN8

高

高鵬翔

在笔记本或代码的开头添加以下代码行

import tensorflow as tf

physical_devices = tf.config.experimental.list_physical_devices('GPU')

tf.config.experimental.set_memory_growth(physical_devices[0], True)

Jensun Ravichandran

我有一个类似的问题。 Tensorflow 抱怨说它期望某个版本的 cuDNN，但不是它找到的那个。因此，我从 https://developer.nvidia.com/rdp/cudnn-archive 下载了它预期的版本并安装了它。现在可以了。

dpacman

如果您使用 Conda 安装了 Tensorflow-gpu，则安装与它一起安装的 cudnn 和 cudatoolkit 并重新运行 notebook。

注意：尝试在 conda 中仅卸载这两个包将强制卸载一系列其他包。因此，使用以下命令仅卸载这些包

(1) 删除 cuda

conda remove --force cudatookit

(2) 去除cudnn

conda remove --force cudnn

现在运行 Tensorflow，它应该可以工作了！

J B

没有任何代表，我无法将其添加为对上述来自 Anurag 和 Obnebion 的两个现有答案的评论，我也无法对答案进行投票，因此即使它似乎违反了指导方针，我也会做出新的答案。无论如何，我最初遇到此页面地址上的其他答案的问题，并修复了它，但后来当我开始使用检查点回调时再次遇到相同的消息。在这一点上，只有 Anurag/Obnebion 的答案是相关的。事实证明，我最初将模型保存为 .json 并将权重分别保存为 .h5，然后使用 model_from_json 和单独的 model.load_weights 再次获取权重。那行得通（我有 CUDA 10.2 和 tensorflow 2.x）。只有当我尝试从检查点回调切换到这个多合一的 save/load_model 时，它才会中断。这是我在 _save_model 方法中对 keras.callbacks.ModelCheckpoint 所做的小改动：

                            if self.save_weights_only:
                                self.model.save_weights(filepath, overwrite=True)
                            else:
                                model_json = self.model.to_json()
                                with open(filepath+'.json','w') as fb:
                                    fb.write(model_json)
                                    fb.close()
                                self.model.save_weights(filepath+'.h5', overwrite=True)
                                with open(filepath+'-hist.pickle','wb') as fb:
                                    trainhistory = {"history": self.model.history.history,"params": self.model.history.params}
                                    pickle.dump(trainhistory,fb)
                                    fb.close()
                                # self.model.save(filepath, overwrite=True)

历史pickle 转储只是堆栈溢出的另一个问题，即当您从检查点回调中提前退出时历史对象会发生什么。好吧，您可以在 _save_model 方法中看到一行将丢失监视器数组从日志字典中拉出...但从不将其写入文件！所以我只是相应地加入了kludge。大多数人不建议使用这样的泡菜。我的代码只是一个黑客，所以没关系。

Anurag Bhalekar

图书馆似乎需要一些热身。这不是生产的有效解决方案，但您至少可以继续处理其他错误......

from keras.models import Sequential
import numpy as np
from keras.layers import Dense
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
model = Sequential()
model.add(Dense(1000,input_dim=(784),activation='relu') )  #imnput layer
model.add(Dense(222,activation='relu'))                     #hidden layer
model.add(Dense(100,activation='relu'))   
model.add(Dense(50,activation='relu'))   
model.add(Dense(10,activation='sigmoid'))   
model.compile(optimizer="adam",loss='categorical_crossentropy',metrics=["accuracy"])
x_train = np.reshape(x_train,(60000,784))/255
x_test = np.reshape(x_test,(10000,784))/255
from keras.utils import np_utils
y_train = np_utils.to_categorical(y_train) 
y_test = np_utils.to_categorical(y_test)
model.fit(x_train[:1000],y_train[:1000],epochs=1,batch_size=32)

k.akash

只需使用以下命令安装带有 GPU 的 TensorFlow：pip install tensorflow;您不需要单独安装 GPU。如果您单独安装 GPU，那么很可能会与它们的版本不匹配。

但是对于 1.15 和更早的版本，CPU 和 GPU 包是分开的。

Ivan

我在 AWS Ubuntu 实例上工作了一段时间。

然后，我找到了解决方案，在这种情况下非常简单。

不要使用 pip (pip install tensorflow-gpu) 安装 tensorflow-gpu，而是使用 conda (conda install tensorflow-gpu) 安装，以便它在 conda 环境中，并且在正确的环境中安装 cudatoolkit 和 cudnn。

这对我有用，挽救了我的一天，并希望它对其他人有所帮助。

请参阅 learnermaxRL 的原始解决方案：https://github.com/tensorflow/tensorflow/issues/24828#issuecomment-453727142

future

如果您是中国人，请确保您的工作路径不包含中文，并将您的batch_size 越来越小。谢谢！

关注公众号

不定期副业成功案例分享

想领先一步获取最新的外包任务吗？

立即订阅

获取卷积算法失败。这可能是因为 cuDNN 初始化失败，

关注公众号

想领先一步获取最新的外包任务吗？

平台

支持

联系我们