ChatGPT解决这个技术问题 Extra ChatGPT

How to append data to one specific dataset in a hdf5 file with h5py

I am looking for a possibility to append data to an existing dataset inside a .h5 file using Python (h5py).

A short intro to my project: I try to train a CNN using medical image data. Because of the huge amount of data and heavy memory usage during the transformation of the data to NumPy arrays, I needed to split the "transformation" into a few data chunks: load and preprocess the first 100 medical images and save the NumPy arrays to hdf5 file, then load the next 100 datasets and append the existing .h5 file, and so on.

Now, I tried to store the first 100 transformed NumPy arrays as follows:

import h5py
from LoadIPV import LoadIPV

X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()

with h5py.File('.\PreprocessedData.h5', 'w') as hf:
    hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))
    hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))

As can be seen, the transformed NumPy arrays are splitted into four different "groups" that are stored into the four hdf5 datasets[X_train, X_test, Y_train, Y_test]. The LoadIPV() function performs the preprocessing of the medical image data.

My problem is that I would like to store the next 100 NumPy arrays into the same .h5 file into the existing datasets: that means that I would like to append to, for example, the existing X_train dataset of shape [100, 512, 512, 9] with the next 100 NumPy arrays, such that X_train becomes of shape [200, 512, 512, 9]. The same should work for the other three datasets X_test, Y_train and Y_test.


n
nbro

I have found a solution that seems to work!

Have a look at this: incremental writes to hdf5 with h5py!

In order to append data to a specific dataset it is necessary to first resize the specific dataset in the corresponding axis and subsequently append the new data at the end of the "old" nparray.

Thus, the solution looks like this:

with h5py.File('.\PreprocessedData.h5', 'a') as hf:
    hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)
    hf["X_train"][-X_train_data.shape[0]:] = X_train_data

    hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)
    hf["X_test"][-X_test_data.shape[0]:] = X_test_data

    hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)
    hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data

    hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)
    hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data

However, note that you should create the dataset with maxshape=(None,), for example

h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,)) 

otherwise the dataset cannot be extended.


For this to work, you also need to make sure you set the maxshape argument when creating the dataset, or h5py won't let you extend it
Just to be super clear on how to create the dataset in the first place, here's what it would look like: h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,)) The key part setting up the maxshape to be a tuple as I have it.
when you create a dataset with a particular compression and compression level, would the new appended data also have the same compression level/?
What is the purpose of using axis=0. For me it returns an error SyntaxError: invalid syntax.
I need to measure the difference in performance, but I think the usual way of resizing arrays is to append to them until they are full, then resize to twice the current length to avoid too many resize calls. It's probably only need for apps that write new data in real time though.
r
rvimieiro

@Midas.Inc answer works great. Just to provide a minimal working example for those who are interested:

import numpy as np
import h5py

f = h5py.File('MyDataset.h5', 'a')
for i in range(10):

  # Data to be appended
  new_data = np.ones(shape=(100,64,64)) * i
  new_label = np.ones(shape=(100,1)) * (i+1)

  if i == 0:
    # Create the dataset at first
    f.create_dataset('data', data=new_data, compression="gzip", chunks=True, maxshape=(None,64,64))
    f.create_dataset('label', data=new_label, compression="gzip", chunks=True, maxshape=(None,1)) 
  else:
    # Append new data to it
    f['data'].resize((f['data'].shape[0] + new_data.shape[0]), axis=0)
    f['data'][-new_data.shape[0]:] = new_data

    f['label'].resize((f['label'].shape[0] + new_label.shape[0]), axis=0)
    f['label'][-new_label.shape[0]:] = new_label

  print("I am on iteration {} and 'data' chunk has shape:{}".format(i,f['data'].shape))

f.close()

The code outputs:

#I am on iteration 0 and 'data' chunk has shape:(100, 64, 64)
#I am on iteration 1 and 'data' chunk has shape:(200, 64, 64)
#I am on iteration 2 and 'data' chunk has shape:(300, 64, 64)
#I am on iteration 3 and 'data' chunk has shape:(400, 64, 64)
#I am on iteration 4 and 'data' chunk has shape:(500, 64, 64)
#I am on iteration 5 and 'data' chunk has shape:(600, 64, 64)
#I am on iteration 6 and 'data' chunk has shape:(700, 64, 64)
#I am on iteration 7 and 'data' chunk has shape:(800, 64, 64)
#I am on iteration 8 and 'data' chunk has shape:(900, 64, 64)
#I am on iteration 9 and 'data' chunk has shape:(1000, 64, 64)