About
Here's a synthetic multi-dimensional data set generator I've made. If you have any info or questions please contact me.
Quite a few times I have run into the situation where I needed a several different sets of test data to get an idea of the behavior of a certain algorithms performance under different conditions.
I think this is a quite common problem because it is difficult to find and import natural data sets. It is even harder to to find a nice spectrum of data sets with different qualities to do comparative analyisis with.
It is also vey important that any synthetic data set have the same kinds of attributes that natural sets do.
My friend Ciphergoth was working on a program to do searches within metric spaces and it seemed liked he needed a good tool to see how his "new" algorithm faired against the other well known ones.
Since I felt that it could be useful to someone else that finally motivated me to write it even though I had been kicking around the idea in my head for some time. This program is the result.
Background
The program is written in C++ and is designed to be integrated into the code of your project. It is not designed to run in a standalone way (though the example application runs just fine). Please make sure to read the licensing conditions. This software is protected under the GPL.
A few things to note:
-
There's a lot of comments and background in the code itself so please make sure that if you have any questions to take a good peek there first.
-
I'm const crazy - If you don't like it, then tough. I'm not some newbie who discovered C++ yesterday and is having a field day with the various keywords. The const keyword is one of the best ways there is to catch and prevent bugs. It also has the advantage of allowing any halfway intelligent compiler to do some excellent optimisation that it might not otherwise be able to. By using const temporary variables you can make you code by easy to debug and easy for the compiler to optimize.
-
The code presented here is purposely not templated. I did this to allow for use in dynamic applications where the number of different dimensions needed can be driven by real time processes (such as user input or inside the for loop of a test suite).
-
During the initialisation of the "noise" function the code uses culling based methods to generate the gradients within a unit sphere of N dimensions. Since the ratio of volume of a sphere compared to the volume of the cube that contains it goes down dramatically as dimensionality increases you will want to keep the number of dimensions relatively low. I have found that up to 13 dimensions works at reasonable speeds on a modern machine.
-
I would be very happy if someone knows of/invents an algorithm/code to directly generate random N dimensional vectors within a unit sphere (this seems possible since it is easily done in 2 and 3 dimensions by similar non-culling based techniques). Please contact me if you do.
-
When creating points within the spatial set I again use culling based methods for the generation of points. This means that if your spatial structures are to sparse or fuzzy that generating valid points can take very long amounts of time.
-
I would be very happy if someone knows of/invents an algorithm/code to directly generate random N dimensional vectors within the boundaries of the "noise" field, but suspect that this is a very difficult problem (since the whole point of the "noise" field is to create structured randomness). Please contact me if you do.
Dowload
datagen.zip (1,454 Kb).