- You will eventually want to read and write data, too
- Badly implemented I/O can easily take 99% of your run-time!
- plain text is portable, but
- horribly space wasting (unless compressed, which then harms portability)
- horribly slow as everything needs to be converted
- unless your actual calculations are done using letters
- cannot really be parallelised so in parallel even slower
- some file formats will be hard or impossible to open on another computer so
- avoid machine dependent I/O, like
- The unformatted IO in Fortran
- unlikely to be efficient
- likely to be very hard to open on another machine or Fortran library
- C’s
fwrite(data, size, count, file)
- with the right
size
andcount
this can be very efficient for a single-threaded application - but not very portable: few data types are guaranteed to be equivalent across machines and byte-order can also change
- with the right
- The unformatted IO in Fortran
- some good libraries to portable I/O
- HDF5 is de facto high-performance data format and very standardised
- cross-language, cross-platform: available in every imaginable system and language!
- will not load the file to memory unless it is sliced or altered: useful for processing files which are too large for your memory if you do not need to alter the data
- netcdf4 is also quite efficient (in fact it’s HDF5 underneath)
- sometimes numpy’s
save
,savez
andsavez_compressed
are ok- remember to pass
allow_pickle=False
to maintain portability - when loading, pass
mmap_mode
parameter to use memmaped IO: useful for large files when only part of it is needed - but have a look at Blaze, too
- remember to pass
- HDF5 is de facto high-performance data format and very standardised
- suppose you have a file like this
# This file contains a bunch of point particles of varying locations, speeds, # masses and charges X Y Z Vx Vy Vz mass charge 0 0 0 0 0 0.1 100.0 0.0 1.0 0 4 0 10.0 0 10.0 -1.0 0 2 0 0.1 0 0.1 1000.0 0.0 2 2 -3 0.1 0 0.1 100.0 1.0 # A comment line
- you can import this with numpy easily
<<genfromtxt_example_import_importer>>
print(data)
- please see
help(numpy.genfromtxt)
for full documentation: the function is capable of ingesting almost any kind of textual data, even strings! - but it is not fast, no text import ever is
- we’ll later show how to plot this
- HDF5 is a hierarchical data format, you can think of it as a file system of a sort, but inside the file:
- data lives in a
dataset
, of any dimensionality and various types (integer, float, double…) - metadata in an
attribute
- there can be any number of both
- a
group
can be used to grouped them
- data lives in a
h5py
exposes datasets as dicts of {datasetname: datasetvalues} and attributes as python attributes; groups are also dicts where the values are the groups members (usually datasets, i.e. a dict within a dict)- the dataset looks and feels like a numpy array, except
- it can only be resized if declared resizable
- is a memory mapped array (more on that in an exercise)
import h5py
import numpy
f=h5py.File(<<h5py_read_example_filename>>,"r")
g = f["my_grid_data"]
v = f["my_vector_data"]
print("Shapes")
print(g.shape, v.shape)
print("Maximum of coordinate values")
print(g[:].max())
print("Vector norms squared")
print(numpy.einsum('i...,i...', v, v))
Shapes (2, 2, 2) (3, 2, 2) Maximum of coordinate values 11.0 Vector norms squared [[ 1.04592874 0.25349788] [ 0.33607341 0.86112512]]
- latency is more likely to ruin performance than anything else, so unless you know exactly where the I/O bottleneck is, do big writes into big files, even buffering internally in your code if necessary
- and big writes really means big: a 10 MB write is not a big write, let alone a big file!
- unfortunately, python is not very good at demonstrating this but you can try to compile and run this
(available in
codes/cpp/chunk_size_effect.c
)
// This file is generated by org-mode, please do not edit
#define _GNU_SOURCE 1
#define _POSIX_C_SOURCE 200809L
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define SIZE 1000*1000*100
int main(int argc, char *argv[]) {
char *file1, *file2;
if (argc != 3) {
// please note this is UNSAFE: if such files exist, they will be overwritten
file1 = "testfile1";
file2 = "testfile2";
} else {
file1 = argv[1];
file2 = argv[2];
}
int fd1 = open(file1, O_WRONLY|O_TRUNC|O_CREAT, S_IRUSR|S_IWUSR);
int fd2 = open(file2, O_WRONLY|O_TRUNC|O_CREAT, S_IRUSR|S_IWUSR);
double *data = (double *) calloc(SIZE, sizeof(double));
struct timespec t1, t2, t3;
clock_gettime(CLOCK_MONOTONIC, &t1);
for (int i=0; i<SIZE; i++) {
write(fd1, data+i, sizeof(double)*1);
}
clock_gettime(CLOCK_MONOTONIC, &t2);
write(fd2, data, sizeof(double)*SIZE);
clock_gettime(CLOCK_MONOTONIC, &t3);
printf("Writing one element at a time took %6li seconds\n", t2.tv_sec-t1.tv_sec);
printf("Writing all elements at once took %6li seconds\n", t3.tv_sec-t2.tv_sec);
close(fd1);
close(fd2);
return 0;
}
Writing one element at a time took 56 seconds Writing all elements at once took 0 seconds
- Performant IO is a bit of a dark magic as there are loads of caches on the way from memory to disc and only
the limit as file size goes to infinity will measure true IO speed
- in the above case, my laptop gives 71 and 2 seconds, but 2 s is 4 times the theoretical maximum speed!
- Even more of a dark magic as disc, unlike the CPU, is a shared resource: other users use same discs
- always use parallel I/O for parallel programs
- poor man’s parallel I/O
- every worker writes its own file
- can be the fastest solution
- but how do you use those files with different number of workers for e.g. post-processing?
- MPI I/O or MPI-enabled HDF5 library deal with that
- they can write a single file simultaneously from all workers
- may do some hardware-based optimisations behind the scenes
- can also map the writes to the MPI topology
- needs a bit of a learning curve, unless you chose to use h5py or some other library like it which handles the complexity for you
- Your code should be able to do this on its own to support solving the problem by running the code several times: often not possible to obtain access to a computer for long enough to solve in one go.
- Basically, you save your iterate or current best estimate solution and later load it from file instead of using random or hard coded initial conditions.
Unfortunately cannot speed-test these easily, but try at least
- On your own
- numpy functions
- h5py
- Sometimes your file is too big to load into memory, memmap is then your friend.
- Files which have been memmapped, are only loaded into memory a small chunk at a time as it is needed
- But they look like normal files to whoever is using them
- Use h5py’s memmap mode and numpy’s memmap mode to process (does not matter what you do with it, perhaps just
add one) the file you saved above
- nothing in your code would change if you needed to process the largest file in the world
- The
matplotlib
python package is terribly good but cannot do Big Data as it is not distributed- has extensive documentation at matplotlib homepage
- It’s also not properly parallel so it can often be slow
- But it is
- easy
- interactive
- if you only need to plot a subset of your data (e.g. 2D slice of 3D data) it might scale well enough
- please note that interactivity over the network will be laggy; we show how it works anyway
- the following “ipython magic” is only needed to embed the output in the ipython/jupyter notebook
- it needs to be done once per python session, so please always execute this cell even if you only want to look at a single later example
- this isn’t required for people running python inline/from a file
%matplotlib notebook
import pylab, numpy
x = numpy.mgrid[-5:5:100j]
pylab.plot(x, x**2, "b-", label=r"$x^2$")
pylab.legend()
<<pylab_plot_example_export>>
- in this example we use the file we created earlier:
files/genfromtxt_example_data.txt
and save it to another calledfiles/genfromtxt_example_data.png
infile = "files/genfromtxt_example_data.txt"
oufile = "files/genfromtxt_example_plot.png"
import numpy
import matplotlib
import matplotlib.pyplot
from mpl_toolkits.mplot3d import Axes3D
def randrange(n, vmin, vmax):
return (vmax - vmin)*numpy.random.rand(n) + vmin
data = numpy.genfromtxt(infile, comments="#", delimiter="\t", skip_header=3)
fig = matplotlib.pyplot.figure()
ax = fig.add_subplot(111, projection='3d')
n = data.shape[0]
# plot a sphere for each particle
# colour charged particles red (charge>0), blue (charge<0) and neutrals green
blues = data[data[:,7]<0]
reds = data[data[:,7]>0]
greens=data[numpy.logical_not(numpy.logical_or(data[:,7]<0,data[:,7]>0))]
ax.scatter(blues[:,0], blues[:,1], blues[:,2], c="b", edgecolors="face",
marker="o", s=blues[:,6])
ax.scatter(reds[:,0], reds[:,1], reds[:,2], c="r", edgecolors="face",
marker="o", s=greens[:,6])
ax.scatter(greens[:,0], greens[:,1], greens[:,2], c="g", edgecolors="face",
marker="o", s=greens[:,6])
ax.quiver(blues[:,0], blues[:,1], blues[:,2], blues[:,3], blues[:,4],
blues[:,5], pivot="tail")
ax.quiver(reds[:,0], reds[:,1], reds[:,2], reds[:,3], reds[:,4],
reds[:,5], pivot="middle")
ax.quiver(greens[:,0], greens[:,1], greens[:,2], greens[:,3], greens[:,4],
greens[:,5], pivot="tip")
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
matplotlib.pyplot.savefig(oufile)
print(oufile, end="")
- we did not use anything advanced, except matplotlib’s builtin latex capability, but it provides a full control of the whole canvas and image window
- matplotlib has a rudimentary animation capability as well
- ParaView is better in this, and matplotlib will not be able to create beautiful complex animations
- but it can do simple ones
- and it can be used to generate lots of frames for a video
- but unless you use matplotlib-frontend specific, just using file-write backend directly, without plotting on screen is much faster
- in both cases you can convert to video like
ffmpeg -f image2 -pattern_type glob -framerate 25 -i\ 'testanimationsaveframe_*.png' -s 800x600 foo.mkv
- or illustrate how an algorithm works, see exercises!
- here’s an example with all the important bits:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
plt.ion()
def data_gen(t=0):
'''A generator function, which must use "yield" as all generators do,
to produce results one frame at a time. In this example, the "run"
function will actually remember/save data for previous frames so
we get away with generating just the new data. Whatever we return
will be passed as the sole argument to "run".'''
cnt = 0
while cnt < 1000:
cnt += 1
t += 0.1
yield t, np.sin(2*np.pi*t) * np.exp(-t/10.)
def init():
'''A setup function, called before the animation begins.'''
ax.set_ylim(-1.1, 1.1)
ax.set_xlim(0, 100)
del xdata[:]
del ydata[:]
line.set_data(xdata, ydata)
return line,
fig, ax = plt.subplots()
line, = ax.plot([], [], lw=2)
ax.grid()
xdata, ydata = [], []
def run(data, args):
'''This is called by the animator for each frame with new data from
"data_gen" each time. What we do here is up to us: we could even
write the plot to disc (see the commented-out line) or we could do
something completely unrelated to matplotlib! The present code
will append new data to its old (global variable) data and
generate a new animation frame. Note that matplotlib holds a copy
of our old data so we could fish it out from the depths of its
internal representation and append to that but that's a bit
complicated for our example here. We have been passed "args" but
we ignore that.'''
t, y = data
xdata.append(t)
ydata.append(y)
xmin, xmax = ax.get_xlim()
if t >= xmax:
ax.set_xlim(xmin, 2*xmax)
ax.figure.canvas.draw()
line.set_data(xdata, ydata)
return line,
ani = animation.FuncAnimation(fig, run, data_gen, blit=False, interval=10,
fargs=("arguments",), repeat=False, init_func=init)
plt.show()
Use your Game of Life from earlier on and animate it using FuncAnimation
. You have already written the
stepper in such a way that it is easy to wrap into a small “run” function which generates frames one at a
time. Hint: easiest way to plot is probably matplotlib’s imshow
function.
Available in the repo.
import sys
#sys.path.append("codes/python")
import Game_of_Life
import matplotlib
import matplotlib.pyplot
import matplotlib.animation
matplotlib.pyplot.ion()
def frame_generator(iteration, state, fig, ax):
state[:] = Game_of_Life.step(state)[:]
axesimage = ax.imshow(state)
return [axesimage]
def animate_game(size=(100,100)):
fig = matplotlib.pyplot.figure()
ax = fig.add_subplot(111)
state = Game_of_Life.initial(size)
ani = matplotlib.animation.FuncAnimation(fig, frame_generator, fargs=(state, fig, ax),
blit=False, interval=10, frames=10,
repeat=True)
matplotlib.pyplot.show()
return ani
- ParaView, as the name suggests, runs in (distributed) parallel: no data is too big if you managed to create it in the first place
- Some complications in getting the proper distributed parallel version up and running:
- ParaView is split into a client and a server
- normal
paraview
command runs client with a local server, but not in parallel - not what you want anyway: you can run ParaView this way on your supercomputer, but the UI will be very slow as all plotting data and interaction need to go over the network
- you need to run
pvserver
on the “big” machine and connectparaview
frontend to that