A MATLAB/Octave function which generates 2D data clusters. Data is created along straight lines, which can be more or less parallel depending on the selected input parameters.
[data, clustPoints, idx, centers, angles, lengths] = ...
generateData(angleMean, angleStd, numClusts, xClustAvgSep, yClustAvgSep, ...
lengthMean, lengthStd, lateralStd, totalPoints, ...)
Parameter | Description |
---|---|
angleMean |
Mean angle in radians of the lines on which clusters are based. Angles are drawn from the normal distribution. |
angleStd |
Standard deviation of line angles. |
numClusts |
Number of clusters (and therefore of lines) to generate. |
xClustAvgSep |
Average separation of line centers along the X axis. |
yClustAvgSep |
Average separation of line centers along the Y axis. |
lengthMean |
Mean length of the lines on which clusters are based. Line lengths are drawn from the folded normal distribution. |
lengthStd |
Standard deviation of line lengths. |
lateralStd |
Cluster "fatness", i.e., the standard deviation of the distance from each point to its projection on the line. The way this distance is obtained is controlled by the optional 'pointOffset' parameter. |
totalPoints |
Total points in generated data. These will be randomly divided between clusters using the half-normal distribution with unit standard deviation. |
Parameter name | Parameter values | Default value | Description |
---|---|---|---|
allowEmpty |
true , false |
false |
Allow empty clusters? |
pointDist |
'unif' , 'norm' |
unif |
Specifies the distribution of points along lines, with two possible values: 1) 'unif' distributes points uniformly along lines; or, 2) 'norm' distribute points along lines using a normal distribution (line center is the mean and the line length is equal to 3 standard deviations). |
pointOffset |
1D , 2D |
2D |
Controls how points are created from their projections on the lines, with two possible values: 1) '1D' places points on a second line perpendicular to the cluster line using a normal distribution centered at their intersection; or, 2) '2D' places point using a bivariate normal distribution centered at the point projection. |
Value | Description |
---|---|
data |
Matrix (totalPoints x 2) with the generated data. |
clustPoints |
Vector (numClusts x 1) containing number of points in each cluster. |
idx |
Vector (totalPoints x 1) containing the cluster indices of each point. |
centers |
Matrix (numClusts x 2) containing line centers from where clusters were generated. |
angles |
Vector (numClusts x 1) containing the effective angles of the lines used to generate clusters. |
lengths |
Vector (numClusts x 1) containing the effective lengths of the lines used to generate clusters. |
[data cp idx] = generateData(pi / 2, pi / 8, 5, 15, 15, 5, 1, 2, 200);
The previous command creates 5 clusters with a total of 200 points, with a mean angle of π/2 (std=π/8), separated in average by 15 units in both x and y directions, with mean length of 5 units (std=1) and a "fatness" or spread of 2 units.
The following command plots the generated clusters:
scatter(data(:, 1), data(:, 2), 8, idx);
The following command generates 7 clusters with a total of 100 000 points. Optional parameters are used to override the defaults.
[data cp idx] = generateData(0, pi / 16, 7, 25, 25, 25, 5, 1, 100000, ...
'pointDist', 'norm', 'pointOffset', '1D', 'allowEmpty', true);
The generated clusters can be visualized with the same scatter
command used
in the previous example.
To make cluster generation reproducible, set the random number generator seed to a specific value (e.g. 123) before generating the data:
rng(123);
For GNU Octave, use the following instructions instead:
rand("state", 123);
randn("state", 123);
Before v2.0.0, lines supporting clusters were parameterized with slopes instead of angles. We found this caused difficulties when choosing line orientation, thus the change to angles, which are much easier to work with. Version v1.3.0 still uses slopes, for those who prefer this behavior.
For reproducing results in studies published before May 2020, use version v1.2.0 instead. Subsequent versions were optimized in a way that changed the order in which the required random values are generated, thus producing slightly different results.
If you use this function in your work, please cite the following reference:
- Fachada, N., & Rosa, A. C. (2020). generateData—A 2D data generator. Software Impacts, 4:100017. doi: 10.1016/j.simpa.2020.100017
The MOCluGen toolbox extends generateData with arbitrary dimensions and statistical distributions. Therefore, generateData offers a limited subset of the functionality provided by MOCluGen, although it's probably simpler to use.
This script is made available under the MIT License.