SimMetrics

A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.

Usage

For a quick and easy use StringMetrics and StringDistances contain a collection of well known similarity and distance metrics.

String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";

StringMetric metric = StringMetrics.cosineSimilarity();

float result = metric.compare(str1, str2); //0.4767

The StringMetricBuilder and StringDistanceBuilder are convenience tools to build string similarity and distance metrics. Any class implementing Metric or Distance respectively can be used to build a metric. The builders support simplification, tokenization, token-filtering, token-transformation, and caching. For usage see the examples section.

For a terse syntax use import static org.simmetrics.builders.StringMetricBuilder.with;

String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";

StringMetric metric =
		with(new CosineSimilarity<String>())
		.simplify(Simplifiers.toLowerCase(Locale.ENGLISH))
		.simplify(Simplifiers.replaceNonWord())
		.tokenize(Tokenizers.whitespace())
		.build();

float result = metric.compare(str1, str2); //0.5720

Metrics that operate on lists, sets, or multisets are generic can be used to compare collections of arbitrary elements. The elements in the collection must implement equals and hashcode.

Set<Integer> scores1 = new HashSet<>(asList(1, 1, 2, 3, 5, 8, 11, 19));
Set<Integer> scores2 = new HashSet<>(asList(1, 2, 4, 8, 16, 32, 64));

SetMetric<Integer> metric = new OverlapCoefficient<>();

float result = metric.compare(scores1, scores2); // 0.4285

Unicode

Due to Java's Unicode Character Representations some care must be taken when using string metrics that compare strings using char values. Using for example Smith-Waterman on a texts written in Linear-A will result in an unexpectedly high similarity as every other char is the same high surrogate. Metrics that operate on lists, sets, or multisets such as Cosine Similarity are not affected.

Name		Name	Last commit message	Last commit date
Latest commit History 671 Commits
.m2		.m2
simmetrics-core		simmetrics-core
simmetrics-example		simmetrics-example
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS		AUTHORS
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimMetrics

Usage

Unicode

About

Releases

Packages

Contributors 6

Languages

License

Simmetrics/simmetrics

Folders and files

Latest commit

History

Repository files navigation

SimMetrics

Usage

Unicode

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages