Tested with Python >= 3.6.8
tf_dataset
is a python module inspired by the signal.dataset
package in R for creating tensorflow datasets from metadata to be used in machine learning pipelines.
Contains functionality to create a tf.data.Dataset
object from (1) a metadata dictionary as a pd.DataFrame
object, or (2) a dirpath
to data files, for use with ML pipelines.
tf_dataset
also contains functions to efficiently parallel map over tf.data.Dataset
objects, lazily applying arbitrary ops to each element. This code is extensible (in a functional sense), so that users may define their own map calls (see user-defined-maps) and may even define their own callables (see user-defined-loads) for loading data to suit their needs.
git clone https://github.com/ifrit98/tf_dataset.git
cd tf_dataset && pip install .
There are four main entry points we direct users to, which will have what you need for most use cases. Below is a short overview of these main functions. If you have more specialized needs, head to advanced-example.
This function takes a path to the directory where your data lives and a file extension, such as .wav
. It assumes the data is all at the top-level. (i.e. no nested levels). May pass a labels
dictionary mapping, containing class labels (if a classification problem) associated with its respective data filepath in data_dir
.
E.g.:
```{python}
labels = {
'data/signal123.wav': 'cargo',
'data/signal456.wav': 'tug',
...,
'data/signal666.wav': 'whale'
}
df = construct_metadata("./data", ext=".wav", labels=labels)
print(df.head())
```
If no labels dict is passed, construct_metadata()
will attempt to parse labels using a regex, extracting the capture group from the last underscore _
to the file extension, e.g. .wav
.
# Example filepath:
fp = "data/signal123_blah_blah_cargo.wav"
extracted_label = rexeg_filter(fp)
print(extracted_label)
>>> "cargo"
NOTE: If your data filepaths do not follow this naming convention or did not pass a labels
dict mapping, then inferred labels will be meaningless.
If you have regressed values as targets, as opposed to categorical, you may pass a np.ndarray
containing the regressed Y
values for each example to targets
:
df = construct_metadata("./data", create_targets=False, targets=[0,0,1,2])
signal_dataset()
accepts a pandas dataframe minimally containing columns: ['filepath', 'class'], and possibly other assocaited metadata stored as additional columns (such as is returned by construct_metadata()
).
df = construct_metadata("./data")
print(df.head())
>>> filepath class target num_classes
>>> 0 C:\internal\tf_dataset\data\1561965276.252... passenger 0 4
>>> 1 C:\internal\tf_dataset\data\1561970414.69_... passenger 0 4
>>> 2 C:\internal\tf_dataset\data\1561971147.625... cargo 1 4
>>> 3 C:\internal\tf_dataset\data\1561971697.731... tug 2 4
ds = signal_dataset(df)
Instead of using the default, it is also possible to pass a user-defined data loading function, named process_db_example
to signal_dataset()
. See signal-dataset-demo and loading-and-mapping.
training_dataset
is designed to be the standard front-end for converting your metadata into training datasets that can be passed to tf.keras.Model.fit()
all in one go. This allows you to control parameters like batch_size
, win_len
, shuffle_buffer_size
, prefetch_buffer_size
, infinite
(e.g. to repeat the dataset infinitely for training). etc, from one function call.
This function takes a metadata dataframe containing data filepaths and associated metadata, and returns a tf.data.Dataset
object, a lazily evaluated graph object that can be run in eager mode by simply iterating over the dataset using a for
construct:
ds = training_dataset(df)
Where x
is a python dictionary whose entries contain all relevant data and metadata for a single example.
If process_db_example
is not passed (i.e. defaults to None
), then signals will be loaded as 1D arrays with dtype == int32
.
You may also create a tensorflow dataset directly from a directory using dataset_from_dir
, following certain constraints.
They are:
- Files in
data_dir
must be at the top-level. (No nested dirs with data). For example:
data_dir
│ README.md
│ signal001.wav
│ signal002.wav
│ signal003.wav
│ signal004.wav
| ...
│ signal022.wav
- You must do ONE of the following:
- Pass a python
dict
, mapping filepaths to your data with their associated labels.
import os
import numpy as np
from tf_dataset import dataset_from_dir
data_dir = "./data"
files = os.listdir(os.path.abspath(data_dir))
print(files)
>>> ['1561965276.252_1561965582.106_24346_passenger.wav',
>>> '1561970414.69_1561970702.051_24346_passenger.wav',
>>> '1561971147.625_1561971202.077_24346_cargo.wav',
>>> '1561971697.731_1561971718.283_24346_tug.wav']
label_arr = ['passenger', 'passenger', 'cargo', 'tug']
labels = dict(zip(files, label_arr))
print(labels)
>>> {'1561965276.252_1561965582.106_24346_passenger.wav': 'passenger',
>>> '1561970414.69_1561970702.051_24346_passenger.wav': 'passenger',
>>> '1561971147.625_1561971202.077_24346_cargo.wav': 'cargo',
>>> '1561971697.731_1561971718.283_24346_tug.wav': 'tug'}
- Pass a
np.ndarray
(or python list), containing only the labels for each file indata_dir
.
labels = np.asarray(['passenger', 'passenger', 'cargo', 'tug'])
- Pass
None
(default). Filepaths must be of the form"*_label.ext"
e.g."signal_123_cargo.wav"
->"cargo"
labels = None
If your data targets are not categorically valued, you must pass a targets
list or numpy array containing the (regressed or otherwise) values:
targets = [0.5, 0.1, 0.9, 0.33]
ds = dataset_from_dir("./data", targets=targets)
ds = dataset_compact(ds, 'signal', 'target')
Note that you may pass any arguments here that will be passed to training_dataset()
, such as
batch_size
, win_len
, etc. See training-dataset.
Now, using any of the methods above to create labels
, we can create the dataset:
ext = ".wav"
ds = dataset_from_dir(data_dir, labels=labels, ext=ext, batch_size=2)
The main design principle behind tf_dataset
is a functional API, much like the tf.keras
functinal API, in which tf.data.Dataset
objects can be easily manipulated and users may apply whatever impariments or transformations required. Users may create their own map
calls to apply over the dataset object. See user-defined-maps
from os import path
import pandas as pd
import tensorflow as tf
import tf_dataset as tfd
d = {
'filepath': [
'data/1561965276.252_1561965582.106_24346_passenger.wav',
'data/1561970414.69_1561970702.051_24346_passenger.wav',
'data/1561971147.625_1561971202.077_24346_cargo.wav',
'data/1561971697.731_1561971718.283_24346_tug.wav'],
'class': ['passenger', 'passenger', 'cargo', 'tug']
}
df = pd.DataFrame.from_dict(d)
df['filepath'] = df['filepath'].apply(lambda x: os.path.abspath(x))
ds = tfd.signal_dataset(df) # <-- one line to create your basic tf dataset!
# Normalize gain to [-1, 1]
ds = tfd.dataset_signal_normalize(ds)
for x in ds.take(1):
print(x, "\n")
print("min:", tf.reduce_min(x['signal']))
print("max:", tf.reduce_max(x['signal']))
# Slice signals to a specified window length
ds = tfd.dataset_signal_slice_windows(ds, win_len=8192)
# `x` dict now contains `win_start_idx` as metadata and the windowed signal.
for x in ds.take(1):
print(x)
print(x['win_start_idx'])
# Move to complex plane using hilbert transform
ds = tfd.dataset_signal_apply_hilbert(ds)
for x in ds.take(1):
print(x, "\n")
print("signal dtype:", x['signal'].dtype)
batch_size = 8
ds = ds.batch(batch_size)
for x in ds.take(1):
print(x) # `x` now batched e.g. x.shape == (8, 8192)
print("\nsignal shape:", x['signal'].shape)
# Reduce dataset to (x,y) tuple pairs for use with `tf.keras.Model`s.
ds = dataset_compact(ds, 'signal', 'target')
val_ds_size = 5
val_ds = ds.take(val_ds_size)
...
# To be used with tensorflow/keras models
h = model.fit(ds, validation_data=val_ds)
NOTE: training_dataset()
applies all of these map calls by default, tunable by function params (See docstring), such that:
df = construct_metadata("./data")
ds = training_dataset(df, win_len=8192, batch_size=8)
Which is equivalent to using the functional API like so:
df = construct_metadata("./data")
ds = tfd.signal_dataset(df)
ds = tfd.dataset_signal_slice_windows(ds, win_len=8192)
ds = tfd.dataset_signal_apply_hilbert(ds)
ds = tfd.dataset_signal_normalize(ds)
ds = ds_batch(ds, batch_size=8)
ds = ds_prefetch(ds, 1)
Which is also equivalent to:
ds = dataset_from_dir("./data")
This further convenience combines both previous steps into one call from dataset_from_dir()
, which takes kwargs
that will be passed to training_dataset()
.
dataset_compact()
, compacts the python dict
s in a dataset object down into (x,y)
tuple pairs.
See compact.
A demo of how signal_dataset()
works under the hood.
For a tutorial on user defined map and loading functions.
For a tutorial on writing-reading datasets.
A submodule named tf_utils
is accessible from tf_dataset
and contains functions useful for working in tensorflow. Feel free to explore and have fun!
import tf_dataset as tfd
from tf_dataset.tf_utils import *
import tensorflow as tf
x = tf.random.normal([8192])
is_tensor(x)
>>> True
is_scalar_tensor(x)
>>> False
valid_lengths = tf.size(x)
tf_assert_length2(x, valid_lengths)
>>> <tensorflow.python.framework.ops.NullContextmanager at 0x2324252a430> # Pass
x = as_complex_tensor(x)
tf_assert_is_one_signal(x)
>>> <tensorflow.python.framework.ops.NullContextmanager at 0x2324252a430> # Pass
ds = tfd.signal_dataset(tfd.construct_metadat("./data"))
for x in ds: break
shapes(x)
>>> {'index': (),
'filepath': (),
'class': (),
'num_classes': (),
'target': (),
'signal': (986507,)}
print(x)
>>> {'class': <tf.Tensor: shape=(), dtype=string, numpy=b'tug'>,
'filepath': <tf.Tensor: shape=(), dtype=string,
numpy=b'C:\\internal\\tf_dataset\\data\\1561971697.731_1561971718.283_24346_tug.wav'>,
'index': <tf.Tensor: shape=(), dtype=int64, numpy=3>,
'num_classes': <tf.Tensor: shape=(), dtype=int64, numpy=4>,
'signal': <tf.Tensor: shape=(986507,), dtype=int32,
numpy=array([ 1179011410, 3946020, ..., -2113897881, 1764622450])>,
'target': <tf.Tensor: shape=(), dtype=int64, numpy=2>}
Tests are stored in /tf_dataset/test.py
, and can be run with a simple shell command from the top-level directory of this repo:
pytest /tf_dataset/test.py
Go to the maintainers: [@ifrit98] for bug reports, and usage questions that aren't answered here.
Pull requests welcome!