Skip to content

Latest commit

 

History

History
428 lines (403 loc) · 18.7 KB

io_and_vis.org

File metadata and controls

428 lines (403 loc) · 18.7 KB

Introduction to Algorithms in Python

Efficient and portable I/O

  • You will eventually want to read and write data, too
  • Badly implemented I/O can easily take 99% of your run-time!

Portable I/O

  • plain text is portable, but
    • horribly space wasting (unless compressed, which then harms portability)
    • horribly slow as everything needs to be converted
      • unless your actual calculations are done using letters
    • cannot really be parallelised so in parallel even slower
  • some file formats will be hard or impossible to open on another computer so
  • avoid machine dependent I/O, like
    • The unformatted IO in Fortran
      • unlikely to be efficient
      • likely to be very hard to open on another machine or Fortran library
    • C’s fwrite(data, size, count, file)
      • with the right size and count this can be very efficient for a single-threaded application
      • but not very portable: few data types are guaranteed to be equivalent across machines and byte-order can also change
  • some good libraries to portable I/O
    • HDF5 is de facto high-performance data format and very standardised
      • cross-language, cross-platform: available in every imaginable system and language!
      • will not load the file to memory unless it is sliced or altered: useful for processing files which are too large for your memory if you do not need to alter the data
    • netcdf4 is also quite efficient (in fact it’s HDF5 underneath)
    • sometimes numpy’s save, savez and savez_compressed are ok
      • remember to pass allow_pickle=False to maintain portability
      • when loading, pass mmap_mode parameter to use memmaped IO: useful for large files when only part of it is needed
      • but have a look at Blaze, too

numpy can import almost any text file easily

  • suppose you have a file like this
# This file contains a bunch of point particles of varying locations, speeds, 
# masses and charges
X	Y	Z	Vx	Vy	Vz	mass	charge
0	0	0	0	0	0.1	100.0	0.0
1.0	0	4	0	10.0	0	10.0	-1.0
0	2	0	0.1	0	0.1	1000.0	0.0
2	2	-3	0.1	0	0.1	100.0	1.0
# A comment line
  • you can import this with numpy easily
<<genfromtxt_example_import_importer>>
print(data)
  • please see help(numpy.genfromtxt) for full documentation: the function is capable of ingesting almost any kind of textual data, even strings!
  • but it is not fast, no text import ever is
  • we’ll later show how to plot this

Reading a HDF5 dataset using h5py

  • HDF5 is a hierarchical data format, you can think of it as a file system of a sort, but inside the file:
    • data lives in a dataset, of any dimensionality and various types (integer, float, double…)
    • metadata in an attribute
    • there can be any number of both
    • a group can be used to grouped them
  • h5py exposes datasets as dicts of {datasetname: datasetvalues} and attributes as python attributes; groups are also dicts where the values are the groups members (usually datasets, i.e. a dict within a dict)
  • the dataset looks and feels like a numpy array, except
    • it can only be resized if declared resizable
    • is a memory mapped array (more on that in an exercise)
import h5py
import numpy
f=h5py.File(<<h5py_read_example_filename>>,"r")
g = f["my_grid_data"]
v = f["my_vector_data"]
print("Shapes")
print(g.shape, v.shape)
print("Maximum of coordinate values")
print(g[:].max())
print("Vector norms squared")
print(numpy.einsum('i...,i...', v, v))
Shapes
(2, 2, 2) (3, 2, 2)
Maximum of coordinate values
11.0
Vector norms squared
[[ 1.04592874  0.25349788]
 [ 0.33607341  0.86112512]]

Always write huge chunks of data

  • latency is more likely to ruin performance than anything else, so unless you know exactly where the I/O bottleneck is, do big writes into big files, even buffering internally in your code if necessary
  • and big writes really means big: a 10 MB write is not a big write, let alone a big file!
  • unfortunately, python is not very good at demonstrating this but you can try to compile and run this (available in codes/cpp/chunk_size_effect.c)
// This file is generated by org-mode, please do not edit
#define _GNU_SOURCE 1
#define _POSIX_C_SOURCE 200809L
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define SIZE 1000*1000*100

int main(int argc, char *argv[]) {
  char *file1, *file2;
  if (argc != 3) {
    // please note this is UNSAFE: if such files exist, they will be overwritten
    file1 = "testfile1";
    file2 = "testfile2";
  } else {
    file1 = argv[1];
    file2 = argv[2];
  }
  int fd1 = open(file1, O_WRONLY|O_TRUNC|O_CREAT, S_IRUSR|S_IWUSR);
  int fd2 = open(file2, O_WRONLY|O_TRUNC|O_CREAT, S_IRUSR|S_IWUSR);
  double *data = (double *) calloc(SIZE, sizeof(double));
  struct timespec t1, t2, t3;
  clock_gettime(CLOCK_MONOTONIC, &t1);
  for (int i=0; i<SIZE; i++) {
    write(fd1, data+i, sizeof(double)*1);
  }
  clock_gettime(CLOCK_MONOTONIC, &t2);
  write(fd2, data, sizeof(double)*SIZE);
  clock_gettime(CLOCK_MONOTONIC, &t3);
  printf("Writing one element at a time took %6li seconds\n", t2.tv_sec-t1.tv_sec);
  printf("Writing all elements at once took  %6li seconds\n", t3.tv_sec-t2.tv_sec);
  close(fd1);
  close(fd2);
  return 0;
}
Writing one element at a time took     56 seconds
Writing all elements at once took       0 seconds
  • Performant IO is a bit of a dark magic as there are loads of caches on the way from memory to disc and only the limit as file size goes to infinity will measure true IO speed
    • in the above case, my laptop gives 71 and 2 seconds, but 2 s is 4 times the theoretical maximum speed!
  • Even more of a dark magic as disc, unlike the CPU, is a shared resource: other users use same discs

Parallel I/O

  • always use parallel I/O for parallel programs
  • poor man’s parallel I/O
    • every worker writes its own file
    • can be the fastest solution
    • but how do you use those files with different number of workers for e.g. post-processing?
  • MPI I/O or MPI-enabled HDF5 library deal with that
    • they can write a single file simultaneously from all workers
    • may do some hardware-based optimisations behind the scenes
    • can also map the writes to the MPI topology
    • needs a bit of a learning curve, unless you chose to use h5py or some other library like it which handles the complexity for you

Checkpointing

  • Your code should be able to do this on its own to support solving the problem by running the code several times: often not possible to obtain access to a computer for long enough to solve in one go.
  • Basically, you save your iterate or current best estimate solution and later load it from file instead of using random or hard coded initial conditions.

Exercises

Experiment with different way so saving a 100x100x100 numpy array

Unfortunately cannot speed-test these easily, but try at least

  1. On your own
  2. numpy functions
  3. h5py

Memmapped IO

  • Sometimes your file is too big to load into memory, memmap is then your friend.
  • Files which have been memmapped, are only loaded into memory a small chunk at a time as it is needed
  • But they look like normal files to whoever is using them
  • Use h5py’s memmap mode and numpy’s memmap mode to process (does not matter what you do with it, perhaps just add one) the file you saved above
    • nothing in your code would change if you needed to process the largest file in the world

Simple Visualisation

matplotlib

  • The matplotlib python package is terribly good but cannot do Big Data as it is not distributed
  • It’s also not properly parallel so it can often be slow
  • But it is
    • easy
    • interactive
    • if you only need to plot a subset of your data (e.g. 2D slice of 3D data) it might scale well enough
  • please note that interactivity over the network will be laggy; we show how it works anyway
  • the following “ipython magic” is only needed to embed the output in the ipython/jupyter notebook
    • it needs to be done once per python session, so please always execute this cell even if you only want to look at a single later example
    • this isn’t required for people running python inline/from a file
%matplotlib notebook

A Simple Example: a parabola

import pylab, numpy
x = numpy.mgrid[-5:5:100j]
pylab.plot(x, x**2, "b-", label=r"$x^2$")
pylab.legend()
<<pylab_plot_example_export>>

Plotting a Saved File: a simple 3D example

  • in this example we use the file we created earlier: files/genfromtxt_example_data.txt and save it to another called files/genfromtxt_example_data.png
infile = "files/genfromtxt_example_data.txt"
oufile = "files/genfromtxt_example_plot.png"
import numpy
import matplotlib
import matplotlib.pyplot
from mpl_toolkits.mplot3d import Axes3D

def randrange(n, vmin, vmax):
    return (vmax - vmin)*numpy.random.rand(n) + vmin

data = numpy.genfromtxt(infile, comments="#", delimiter="\t", skip_header=3)
fig = matplotlib.pyplot.figure()
ax = fig.add_subplot(111, projection='3d')
n = data.shape[0]
# plot a sphere for each particle
# colour charged particles red (charge>0), blue (charge<0) and neutrals green
blues = data[data[:,7]<0]
reds = data[data[:,7]>0]
greens=data[numpy.logical_not(numpy.logical_or(data[:,7]<0,data[:,7]>0))]
ax.scatter(blues[:,0], blues[:,1], blues[:,2], c="b", edgecolors="face",
           marker="o", s=blues[:,6])
ax.scatter(reds[:,0], reds[:,1], reds[:,2], c="r", edgecolors="face",
           marker="o", s=greens[:,6])
ax.scatter(greens[:,0], greens[:,1], greens[:,2], c="g", edgecolors="face",
           marker="o", s=greens[:,6])
ax.quiver(blues[:,0], blues[:,1], blues[:,2], blues[:,3], blues[:,4],
          blues[:,5], pivot="tail")
ax.quiver(reds[:,0], reds[:,1], reds[:,2], reds[:,3], reds[:,4],
          reds[:,5], pivot="middle")
ax.quiver(greens[:,0], greens[:,1], greens[:,2], greens[:,3], greens[:,4],
          greens[:,5], pivot="tip")
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
matplotlib.pyplot.savefig(oufile)
print(oufile, end="")

Advanced Features

  • we did not use anything advanced, except matplotlib’s builtin latex capability, but it provides a full control of the whole canvas and image window

Animation Using matplotlib

  • matplotlib has a rudimentary animation capability as well
    • ParaView is better in this, and matplotlib will not be able to create beautiful complex animations
    • but it can do simple ones
    • and it can be used to generate lots of frames for a video
      • but unless you use matplotlib-frontend specific, just using file-write backend directly, without plotting on screen is much faster
      • in both cases you can convert to video like
        ffmpeg -f image2 -pattern_type glob -framerate 25 -i\
         'testanimationsaveframe_*.png' -s 800x600 foo.mkv
                    
    • or illustrate how an algorithm works, see exercises!
  • here’s an example with all the important bits:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
plt.ion()

def data_gen(t=0):
    '''A generator function, which must use "yield" as all generators do,
    to produce results one frame at a time. In this example, the "run"
    function will actually remember/save data for previous frames so
    we get away with generating just the new data. Whatever we return
    will be passed as the sole argument to "run".'''
    cnt = 0
    while cnt < 1000:
        cnt += 1
        t += 0.1
        yield t, np.sin(2*np.pi*t) * np.exp(-t/10.)

def init():
    '''A setup function, called before the animation begins.'''
    ax.set_ylim(-1.1, 1.1)
    ax.set_xlim(0, 100)
    del xdata[:]
    del ydata[:]
    line.set_data(xdata, ydata)
    return line,

fig, ax = plt.subplots()
line, = ax.plot([], [], lw=2)
ax.grid()
xdata, ydata = [], []

def run(data, args):
    '''This is called by the animator for each frame with new data from
    "data_gen" each time. What we do here is up to us: we could even
    write the plot to disc (see the commented-out line) or we could do
    something completely unrelated to matplotlib!  The present code
    will append new data to its old (global variable) data and
    generate a new animation frame. Note that matplotlib holds a copy
    of our old data so we could fish it out from the depths of its
    internal representation and append to that but that's a bit
    complicated for our example here.  We have been passed "args" but
    we ignore that.'''
    t, y = data
    xdata.append(t)
    ydata.append(y)
    xmin, xmax = ax.get_xlim()
    if t >= xmax:
        ax.set_xlim(xmin, 2*xmax)
        ax.figure.canvas.draw()
    line.set_data(xdata, ydata)
    return line,

ani = animation.FuncAnimation(fig, run, data_gen, blit=False, interval=10, 
                              fargs=("arguments",), repeat=False, init_func=init)
plt.show()

Exercise

Use your Game of Life from earlier on and animate it using FuncAnimation. You have already written the stepper in such a way that it is easy to wrap into a small “run” function which generates frames one at a time. Hint: easiest way to plot is probably matplotlib’s imshow function.

Solution

Available in the repo.

import sys
#sys.path.append("codes/python")
import Game_of_Life

import matplotlib
import matplotlib.pyplot
import matplotlib.animation

matplotlib.pyplot.ion()

def frame_generator(iteration, state, fig, ax):
    state[:] = Game_of_Life.step(state)[:]
    axesimage = ax.imshow(state)
    return [axesimage]
  
def animate_game(size=(100,100)):
    fig = matplotlib.pyplot.figure()
    ax = fig.add_subplot(111)
    state = Game_of_Life.initial(size)
    ani = matplotlib.animation.FuncAnimation(fig, frame_generator, fargs=(state, fig, ax),
                                             blit=False, interval=10, frames=10,
                                             repeat=True)
    matplotlib.pyplot.show()
    return ani

Parallel Visualisation: ParaView (very quick intro)

  • ParaView, as the name suggests, runs in (distributed) parallel: no data is too big if you managed to create it in the first place
  • Some complications in getting the proper distributed parallel version up and running:
    • ParaView is split into a client and a server
    • normal paraview command runs client with a local server, but not in parallel
    • not what you want anyway: you can run ParaView this way on your supercomputer, but the UI will be very slow as all plotting data and interaction need to go over the network
    • you need to run pvserver on the “big” machine and connect paraview frontend to that