Skip to content

Instantly share code, notes, and snippets.

@liyi-1989
Created July 31, 2014 20:00
Show Gist options
  • Select an option

  • Save liyi-1989/8bb558d4cbc33daa65c3 to your computer and use it in GitHub Desktop.

Select an option

Save liyi-1989/8bb558d4cbc33daa65c3 to your computer and use it in GitHub Desktop.

Revisions

  1. liyi-1989 created this gist Jul 31, 2014.
    443 changes: 443 additions & 0 deletions hdf5
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,443 @@
    {
    "metadata": {
    "name": "",
    "signature": "sha256:65aa7ff8ea053a7d34e0d7496f676d92274ad6a7602765834e8e441d7a1d2f7b"
    },
    "nbformat": 3,
    "nbformat_minor": 0,
    "worksheets": [
    {
    "cells": [
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "# Working with HDF5 files in Python\n",
    "\n",
    "## 1. Introduction\n",
    "\n",
    "Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of numerical data. It is an open-source library and file format for storing large amounts of numerical data, originally developed at NCSA.\n",
    "\n",
    "In python you can use the **h5py** package to edit the HDF5 file. For installation issue, please consult [here](http://docs.h5py.org/en/2.3/build.html). If you are new to python, you can easily install the [Anaconda](http://continuum.io/downloads) and it will contains this package and many more commonly used packages.\n",
    "\n",
    "\n",
    "The HDF5 file is just like a file system that stores data. It has only two kinds of objects, the **group** and the **dataset**. The group is just like the folders in a file system, while the dataset is used to store different types of data, like the NumPy array. \n",
    "\n",
    "The data set are saved in the HDF5 file in a way that is similar to the regular file system: `/Folder/SubFolder/DataName`.\n",
    "\n",
    "## 2. HDF5 in Python\n",
    "\n",
    "Let us assume that we have already installed h5py on your computer. We will see how to work with the h5py module. "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "import numpy as np\n",
    "import h5py"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [],
    "prompt_number": 1
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "We could create a HDF5 file object by using the `h5py.File()` function. We could specify the mode as \"r\"(read) or \"w\"(write). By default, it is \"a\"(read and write)."
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "myfile = h5py.File(\"ex1.hdf5\")"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [],
    "prompt_number": 2
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### 2.1 Creating groups\n",
    "\n",
    "Now, we only create an empty HDF5 file `myfile`. We need to add some elements in it. For example, we could use the `myfile.create_group()` function to create a new group(or \"folder\"). "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "myfile.create_group(\"grp1\")"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 3,
    "text": [
    "<HDF5 group \"/grp1\" (0 members)>"
    ]
    }
    ],
    "prompt_number": 3
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "You can also create a group by setting it equals to a variable."
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "group2=myfile.create_group(\"grp2\")"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [],
    "prompt_number": 4
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "For a group object, you could use `keys()` function to get the object(s) name in it."
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "myfile.keys()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 5,
    "text": [
    "[u'grp1', u'grp2']"
    ]
    }
    ],
    "prompt_number": 5
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "Moreover, we could create a subgroup by using the same function for `group2`."
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1=group2.create_group(\"subgroup1\")\n",
    "group2.keys()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 6,
    "text": [
    "[u'subgroup1']"
    ]
    }
    ],
    "prompt_number": 6
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### 2.2 Creating data\n",
    "\n",
    "Now, it is time to make some data in the group. We could create just like a dictionary in python. "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1[\"data1\"]=np.arange(0,10)\n",
    "s1[\"data1\"]"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 7,
    "text": [
    "<HDF5 dataset \"data1\": shape (10,), type \"<i4\">"
    ]
    }
    ],
    "prompt_number": 7
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "The data created can be viewed with the `.value`. "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1[\"data1\"].value"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 8,
    "text": [
    "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
    ]
    }
    ],
    "prompt_number": 8
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "Note that the data object can be used in calculation directly. "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "np.sum(s1[\"data1\"])"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 9,
    "text": [
    "45"
    ]
    }
    ],
    "prompt_number": 9
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1[\"data1\"][2]==2"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 10,
    "text": [
    "True"
    ]
    }
    ],
    "prompt_number": 10
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "Also, we could use the `create_dataset()` fucntion to create a new data set. "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1.create_dataset(\"data2\",(3,5),np.int)"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 11,
    "text": [
    "<HDF5 dataset \"data2\": shape (3, 5), type \"<i4\">"
    ]
    }
    ],
    "prompt_number": 11
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1[\"data2\"].value"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 12,
    "text": [
    "array([[0, 0, 0, 0, 0],\n",
    " [0, 0, 0, 0, 0],\n",
    " [0, 0, 0, 0, 0]])"
    ]
    }
    ],
    "prompt_number": 12
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1.create_dataset(\"data3\",data=np.arange(15))\n",
    "s1[\"data3\"].value"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 13,
    "text": [
    "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])"
    ]
    }
    ],
    "prompt_number": 13
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### 2.3 Deleting groups\n",
    "\n",
    "You could use the `del` key word to delete a group."
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "s1.keys()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 14,
    "text": [
    "[u'data1', u'data2', u'data3']"
    ]
    }
    ],
    "prompt_number": 14
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "del s1[\"data3\"]\n",
    "s1.keys()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 15,
    "text": [
    "[u'data1', u'data2']"
    ]
    }
    ],
    "prompt_number": 15
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 3. Save as CSV file\n",
    "\n",
    "If you want to save the data set in the HDF5 file as the csv file, you could use the **csv** package in python. For example, we create a 5 by 5 matrix under `s1`. And then, we could use the `csv.writer()` and `.writerows()` to edit the csv file. "
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "import csv\n",
    "\n",
    "s1[\"data4\"]=np.random.rand(5,5)\n",
    "\n",
    "csvfile = file('csv_test.csv', 'wb')\n",
    "writer = csv.writer(csvfile)\n",
    "writer.writerows(s1[\"data4\"])\n",
    "\n",
    "csvfile.close()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [],
    "prompt_number": 16
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "myfile.close()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [],
    "prompt_number": 17
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 4. Reference\n",
    "\n",
    "- [**h5py.org**](http://docs.h5py.org/en/2.3/index.html)\n",
    "\n",
    "- [CSV package in Python](https://docs.python.org/2/library/csv.html)"
    ]
    }
    ],
    "metadata": {}
    }
    ]
    }