zenkay/bigdata-ecosystem

Big Data Ecosystem Dataset

Incomplete-but-useful list of big-data related projects packed into a JSON dataset.

Main table: http://bigdata.andreamostosi.name
Raw JSON data: http://bigdata.andreamostosi.name/data.json
Original page on my blog: http://blog.andreamostosi.name/big-data/

Related projects:

Hadoop Ecosystem Table by Javi Roman
Awesome Big Data by Onur Akpolat
Awesome Awesomeness by Alexander Bayandin
Awesome Hadoop by Youngwoo Kim
Queues.io by Łukasz Strzałkowski

Frameworks

Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Distributed Programming

AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
Akela - Mozilla's utility library for Hadoop, HBase, Pig, etc..
AMPLab SIMR - run Spark on Hadoop MapReduce v1.
AMPLab Succinct - Enabling Queries on Compressed Data.
Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
Apache Flink - high-performance runtime, and automatic program optimization.
Apache Gora - framework for in-memory data model and persistence.
Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache Pig - high level language to express data analysis programs for Hadoop.
Apache S4 - framework for stream processing, implementation of S4.
Apache Spark - framework for in-memory cluster computing.
Apache Spark Streaming - framework for stream processing, part of Spark.
Apache Storm - framework for stream processing by Twitter also on YARN.
Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
Cascalog - data processing and querying library.
Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
Concurrent Cascading - framework for data management/analytics on Hadoop.
Damballa Parkour - MapReduce library for Clojure.
Datasalt Pangool - alternative MapReduce paradigm.
DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
DistributedR - scalable high-performance platform for the R language.
eBay Oink - REST based interface for PIG execution.
Facebook Corona - Hadoop enhancement which removes single point of failure.
Facebook Peregrine - Map Reduce framework.
Facebook Scuba - distributed in-memory datastore.
Geotrellis - geographic data processing engine for high performance applications.
GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework.
Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
Google MapReduce - map reduce framework.
Google MillWheel - fault tolerant stream processing framework.
HParser - data parsing transformation environment optimized for Hadoop.
IBM Streams - advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
Kyro - Java serialization and cloning: fast, efficient, automatic.
Lipstick - Pig workflow visualization tool.
Metamarkers Druid - framework for real-time analysis of large datasets.
Netflix Aegisthus - Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
Netflix Lipstick - Pig Visualization framework.
Netflix Mantis - Event Stream Processing System.
Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
Netflix STAASH - language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems.
Netflix Zeno - Netflix's In-Memory Data Propagation Framework.
Nokia Disco - MapReduce framework developed by Nokia.
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Pinterest Pinlater - asynchronous job execution system.
Pydoop - Python MapReduce and HDFS API for Hadoop.
ScaleOut hServer - fast, scalable in-memory data grid for Hadoop.
SeqPig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop .
SigmoidAnalytics Spork - Pig on Apache Spark.
SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. .
Spring for Apache Hadoop - unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive.
SQLStream Blaze - stream processing platform.
Stratio Streaming - the union of a real-time messaging bus with a complex event processing engine using Spark Streaming.
Stratosphere - general purpose cluster computing framework.
Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
Teradata QueryGrid - data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop.
TIBCO ActiveSpaces - in-memory data grid.
Torch - Scientific computing for LuaJIT.
Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
Twitter TSAR - TimeSeries AggregatoR by Twitter.

Distributed Filesystem

Apache HDFS - a way to store large files across multiple machines.
BeeGFS - formerly FhGFS, parallel distributed file system.
Ceph Filesystem - software storage platform designed.
Disco DDFS - distributed filesystem.
Facebook Haystack - object storage system.
Google Colossus - distributed filesystem (GFS2).
Google GFS - distributed filesystem.
Google Megastore - scalable, highly available storage.
GridGain - GGFS, Hadoop compliant in-memory file system.
HDSF-DU - HDFS-DU is an interactive visualization of the Hadoop distributed file system. .
Lustre file system - high-performance distributed filesystem.
Netflix S3mper - library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
Quantcast File System QFS - open-source distributed file system.
Red Hat GlusterFS - scale-out network-attached storage file system.
Tachyon - reliable file sharing at memory speed across cluster frameworks.

Key-Map Data Model

Actian Vector - column-oriented analytic database.
Apache Accumulo - distribuited key/value store, built on Hadoop.
Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
Facebook HydraBase - evolution of HBase made by Facebook.
Google BigTable - column-oriented distributed datastore.
Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
Hypertable - column-oriented distribuited datastore, inspired by BigTable.
InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
Netflix Priam - Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
OhmData C5 - improved version of HBase.
Sqrrl - NoSQL databases on top of Apache Accumulo.
Tephra - Transactions for HBase.
Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.

Document Data Model

Actian Versant - commercial object-oriented database management systems .
Crate Data - is an open source massively scalable data store. It requires zero administration.
Facebook Apollo - Facebook’s Paxos-like NoSQL database.
jumboDB - document oriented datastore over Hadoop.
LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
Microsoft DocumentDB - fully-managed, highly-scalable, NoSQL document database service.
MongoDB - Document-oriented database system.
RavenDB - A transactional, open-source Document Database.
RethinkDB - document database that supports queries like table joins and group by.
TokuMX - High-Performance MongoDB Distribution.

Key-value Data Model

Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies..
Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
Edis - is a protocol-compatible Server replacement for Redis.
ElephantDB - Distributed database specialized in exporting data from Hadoop.
EventStore - distributed time series database.
HyperDex - next generation key-value store.
LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
Linkedin Voldemort - distributed key/value storage system.
Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
Redis - in memory key value datastore.
Redis Sentinel - system designed to help managing Redis instances.
Riak - a decentralized datastore.
Storehaus - library to work with asynchronous key value stores, by Twitter.
Tarantool - an efficient NoSQL database and a Lua application server.
TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.

Graph Data Model

Apache Giraph - implementation of Pregel, based on Hadoop.
Apache Spark Bagel - implementation of Pregel, part of Spark.
ArangoDB - multi model distribuited database.
Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
Faunus - Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
Google Cayley - open-source graph database.
Google Pregel - graph processing framework.
GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
GraphX - resilient Distributed Graph System on Spark.
Gremlin - graph traversal Language.
InfiniteGraph - distributed graph database.
Infovore - RDF-centric Map/Reduce framework.
Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
MapGraph - Massively Parallel Graph processing on GPUs.
Neo4j - graph database writting entirely in Java.
OrientDB - document and graph database.
Phoebus - framework for large scale graph processing.
Sparksee - scalable high-performance graph database.
Titan - distributed graph database, built over Cassandra.
Twitter FlockDB - distribuited graph database.

NewSQL Databases

Actian Ingres - commercially supported, open-source SQL relational database management system.
BayesDB - statistic oriented SQL database.
Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
FoundationDB - distributed database, inspired by F1.
Google F1 - distributed SQL database built on Spanner.
Google Spanner - globally distributed semi-relational database.
H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
HandlerSocket - NoSQL plugin for MySQL/MariaDB.
IBM DB2 - object-relational database management system.
InfiniSQL - infinity scalable RDBMS.
MemSQL - in memory SQL database witho optimized columnar storage on flash.
NuoDB - SQL/ACID compliant distributed database.
Oracle Database - object-relational database management system.
Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
SAP HANA - is an in-memory, column-oriented, relational database management system.
SenseiDB - distributed, realtime, semi-structured database.
Sky - database used for flexible, high performance analysis of behavioral data.
SymmetricDS - open source software for both file and database synchronization.
Teradata Database - complete relational database management system.
VoltDB - in-memory NewSQL database.

Columnar Databases

Amazon RedShift - data warehouse service, based on PostgreSQL.
C-Store - column oriented DBMS.
Google BigQuery - framework for interactive analysis, implementation of Dremel.
Google Dremel - framework for interactive analysis, implementation of Dremel.
MonetDB - column store database.
Parquet - columnar storage format for Hadoop.
Pivotal Greenplum - purpose-built, dedicated analytic data warehouse.
Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Time-Series Databases

Cube - uses MongoDB to store time series data.
InfluxDB - distributed time series database.
Kairosdb - similar to OpenTSDB but allows for Cassandra.
OpenTSDB - distributed time series database on top of HBase.

SQL-like processing

Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
AMPLAB Shark - data warehouse system for Spark.
Apache Drill - framework for interactive analysis, inspired by Dremel.
Apache HCatalog - table and storage management layer for Hadoop.
Apache Hive - SQL-like data warehouse system for Hadoop.
Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
Apache Phoenix - SQL skin over HBase.
BlinkDB - massively parallel, approximate query engine.
Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
Concurrent Lingual - SQL-like query language for Cascading.
Datasalt Splout SQL - full SQL query engine for big datasets.
Facebook PrestoDB - distributed SQL query engine.
JethroData - index-based SQL engine for Hadoop.
Metanautix Quest - data compute engine.
Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
SparkSQL - Manipulating Structured Data Using Spark.
Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
Stinger - interactive query for Hive.
Tajo - distributed data warehouse system on Hadoop.
Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Integrated Development Environments

R-Studio - IDE for R.

Data Ingestion

Amazon Kinesis - real-time processing of streaming data at massive scale.
Apache Chukwa - data collection system.
Apache Flume - service to manage large amount of log data.
Apache Samza - stream processing framework, based on Kafla and YARN.
Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
Apache UIMA - Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.
Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
Facebook Scribe - streamed log data aggregator.
Fluentd - tool to collect events and logs.
Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
Heka - open source stream processing software system.
HIHO - framework for connecting disparate data sources with Hadoop.
LinkedIn Databus - stream of change capture events for a database.
LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
LinkedIn White Elephant - log aggregator and dashboard.
Logstash - a tool for managing events and logs.
Netflix Suro - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa.
Pinterest Secor - is a service implementing Kafka log persistance.
Record Breaker - Automatic structure for your text-formatted data.
TIBCO Enterprise Message Service - standards-based messaging middleware.
Twitter Zipkin - distributed tracing system that helps us gather timing data for all the disparate services at Twitter.
Vibe Data Stream - streaming data collection for real-time Big Data analytics.

Message-oriented middleware

ActiveMQ - open source messaging and Integration Patterns server.
Amazon Simple Queue Service - fast, reliable, scalable, fully managed queue service.
Apache Kafka - distributed publish-subscribe messaging system.
Apache Qpid - messaging tools that speak AMQP and support many languages and platforms.
Apollo - ActiveMQ's next generation of messaging.
Beanstalkd - simple, fast work queue.
Bit.ly NSQ - realtime distributed message processing at scale.
Celery - Distributed Task Queue.
Crossroads I/O - library for building scalable and high performance distributed applications.
Darner - simple, lightweight message queue.
Gearman - Job Server.
HornetQ - open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system.
IronMQ - easy-to-use highly available message queuing service.
Kestrel - distributed message queue system.
Marconi - queuing and notification service made by and for OpenStack, but not only for it.
RabbitMQ - Robust messaging for applications.
RestMQ - message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources.
RQ - simple Python library for queueing jobs and processing them in the background with workers.
Sidekiq - Simple, efficient background processing for Ruby.
ZeroMQ - The Intelligent Transport Layer.

Service Programming

Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
Apache Avro - data serialization system.
Apache Curator - Java libaries for Apache ZooKeeper.
Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
Apache Thrift - framework to build binary protocols.
Apache Zookeeper - centralized service for process management.
Google Chubby - a lock service for loosely-coupled distributed systems.
Linkedin Norbert - cluster manager.
MPICH - high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
OpenMPI - message passing framework.
Serf - decentralized solution for service discovery and orchestration.
Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
Twitter Elephant Bird - libraries for working with LZOP-compressed data.
Twitter Finagle - asynchronous network stack for the JVM.

Scheduling

Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
Apache Falcon - data management framework.
Apache Oozie - workflow job scheduler.
Chronos - distributed and fault-tolerant scheduler.
Linkedin Azkaban - batch workflow job scheduler.
Pinterest Pinball - customizable platform for creating workflow managers.
Sparrow - scheduling platform.

Machine Learning

Apache Mahout - machine learning library for Hadoop.
Ayasdi Core - tool for topological data analysis.
brain - Neural networks in JavaScript.
Cloudera Oryx - real-time large-scale machine learning.
Concurrent Pattern - machine learning library for Cascading.
convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
Decider - Flexible and Extensible Machine Learning in Ruby.
etcML - text classification with machine learning.
Etsy Conjecture - scalable Machine Learning in Scalding.
Google Sibyl - System for Large Scale Machine Learning at Google.
H2O - statistical, machine learning and math runtime for Hadoop.
IBM Watson - cognitive computing system.
MLbase - distributed machine learning libraries for the BDAS stack.
MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
scikit-learn - scikit-learn: machine learning in Python.
Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
Sparkling Water - combine H2OÕs Machine Learning capabilities with the power of the Spark platform.
Vahara - Machine learning and natural language processing with Apache Pig.
Viv - global platform that enables developers to plug into and create an intelligent, conversational interface to anything.
Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
WEKA - suite of machine learning software.
Wit - Natural Language for the Internet of Things.
Wolfram Alpha - computational knowledge engine.

Benchmarking

Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
Berkeley SWIM Benchmark - real-world big data workload benchmark.
Big-Bench - Big Bench Workload Development.
Hive-benchmarks - some benchmarking queries for Apache Hive.
Hive-testbench - Testbench for experimenting with Apache Hive at any data scale..
Intel HiBench - a Hadoop benchmark suite.
Netflix Inviso - performance focused Big Data tool.
PUMA Benchmarking - benchmark suite for MapReduce applications.
Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.

Security

Apache Argus - framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
Apache Knox Gateway - single point of secure access for Hadoop clusters.
Apache Sentry - security module for data stored in Hadoop.
PacketPig - Open Source Big Data Security Analytics.
Voltage SecureData - data protection framework.

System Deployment

Ankush - A big data cluster management tool that creates and manages clusters of different technologies..
Apache Ambari - operational framework for Hadoop mangement.
Apache Bigtop - system deployment framework for the Hadoop ecosystem.
Apache Helix - cluster management framework.
Apache Mesos - cluster manager.
Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
Apache Whirr - set of libraries for running cloud services.
Apache YARN - Cluster manager.
Brooklyn - library that simplifies application deployment and management.
Buildoop - Similar to Apache BigTop based on Groovy language.
Cloudera HUE - web application for interacting with Hadoop.
Deimos - Mesos containerizer hooks for Docker.
Develoop - tool for provisioning, managing and monitoring Apache Hadoop.
Facebook Autoscale - the load balancer will concentrate workload to a server until it has at least a medium-level workload.
Facebook Prism - multi datacenters replication system.
Ganglia Monitoring System - scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them..
Google Borg - job scheduling and monitoring system.
Google Omega - job scheduling and monitoring system.
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting..
Hortonworks HOYA - application that can deploy HBase cluster on YARN.
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs..
Marathon - Mesos framework for long-running services.

Applications

Adobe Spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
Apache Nutch - open source web crawler.
Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
Apache Tika - content analysis toolkit.
Domino - Run, scale, share, and deploy models Ñ without any infrastructure..
Eclipse BIRT - Eclipse-based reporting system.
Eventhub - open source event analytics platform.
HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
Hunk - Splunk analytics for Hadoop.
MADlib - data-processing library of an RDBMS to analyze data.
PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
Qubole - auto-scaling Hadoop cluster, built-in data connectors.
Sense - Cloud Platform for Data Science and Big Data Analytics.
Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
SparkR - R frontend for Spark.
Splunk - analyzer for machine-generated date.
Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Data Warehouse

Google Mesa - highly scalable analytic data warehousing system.
IBM BigInsights - data processing, warehousing and analytics.
Microsoft Cosmos - Microsoft's internal BigData analysis platform.

Search engine and framework

Apache Lucene - Search engine library.
Apache Solr - Search platform for Apache Lucene.
ElasticSearch - Search and analytics engine based on Apache Lucene.
Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig..
Enigma.io - Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
Facebook Unicorn - social graph search platform.
Google Caffeine - continuous indexing system.
Google Percolator - continuous indexing system.
TeraGoogle - large search index.
Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
HBase Coprocessor - implementation of Percolator, part of HBase.
hIndex - Secondary Index for HBase.
Lily HBase Indexer - quickly and easily search for any content stored in HBase.
LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
LinkedIn Galene - search architecture at LinkedIn.
LinkedIn Zoie - is a realtime search/indexing system written in Java.
Sphnix Search Server - fulltext search engine.

MySQL forks and evolutions

Amazon RDS - MySQL databases in Amazon's cloud.
Drizzle - evolution of MySQL 6.0.
Google Cloud SQL - MySQL databases in Google's cloud.
MariaDB - enhanced, drop-in replacement for MySQL.
MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
Percona Server - enhanced, drop-in replacement for MySQL.
ProxySQL - High Performance Proxy for MySQL.
TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

PostgreSQL forks and evolutions

HadoopDB - hybrid of MapReduce and DBMS.
IBM Netezza - high-performance data warehouse appliances.
Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.

Memcached forks and evolutions

Facebook McDipper - key/value cache for flash storage.
Facebook Memcached - fork of Memcache.
Twemproxy - A fast, light-weight proxy for memcached and redis.
Twitter Fatcache - key/value cache for flash storage.
Twitter Twemcache - fork of Memcache.

Embedded Databases

Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
HamsterDB - transactional key-value database.
HanoiDB - Erlang LSM BTree Storage.
LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

ActivePivot - Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing.
Adatao - business intelligence and data science platform.
Apama analytics - platform for streaming analytics and intelligent automated action.
Atigeo xPatterns - data analytics platform.
BIME Analytics - business intelligence platform in the cloud.
Chartio - lean business intelligence platform to visualize and explore your data.
Datapine - self-service business intelligence tool in the cloud.
Jaspersoft - powerful business intelligence suite.
Jedox Palo - customisable Business Intelligence platform.
Microsoft - business intelligence software and platform.
Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
Pentaho - business intelligence platform.
Qlik - business intelligence and analytics platform.
SpagoBI - open source business intelligence platform.
Spotfire - business intelligence platform.
Tableau - business intelligence platform.
Teradata Aster - Big Data Analytics.
Tessera - Environment for Deep Analysis of Large Complex Data.
Zeppelin - open source data analysis environment on top of Hadoop..
Zoomdata - Big Data Analytics.

Data Visualization

Arbor - graph visualization library using web workers and jQuery.
CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
Chart.js - open source HTML5 Charts visualizations.
Crossfilter - avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
Cubism - JavaScript library for time series visualization.
Cytoscape - JavaScript library for visualizing complex networks.
D3 - javaScript library for manipulating documents.
DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
Envisionjs - dynamic HTML5 visualization.
Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections.
Google Charts - simple charting API.
Grafana - graphite dashboard frontend, editor and graph composer.
Graphite - scalable Realtime Graphing.
Highcharts - simple and flexible charting API.
IPython - provides a rich architecture for interactive computing.
Keylines - toolkit for visualizing the networks in your data.
Matplotlib - plotting with Python.
NVD3 - chart components for d3.js.
Peity - Progressive SVG bar, line and pie charts.
Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots..
Recline - simple but powerful library for building data applications in pure Javascript and HTML.
Redash - open-source platform to query and visualize data.
Sigma.js - JavaScript library dedicated to graph drawing.
Vega - a visualization grammar.

Internet of things and sensor data

TempoIQ - Cloud-based sensor analytics.

Papers

2014

2014 - 3D Object Manipulation in a Single Photograph using Stock 3D Models
2014 - A Partitioning Framework for Aggressive Data Skipping
2014 - DeepFace: Closing the Gap to Human-Level Performance in Face Verification
2014 - Fastpass: A Centralized "Zero-Queue" Datacenter Network
2014 - In Search of an Understandable Consensus Algorithm
2014 - Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
2014 - MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
2014 - Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
2014 - Orca A Modular Query Optimizer Architecture for Big Data
2014 - Pigeon: A Spatial MapReduce Language

2013

2013 - A Demonstration of SpatailHadoop: An Efficient MapReduce Framework for Spatial Data
2013 - CG_Hadoop: Computational Geometry in MapReduce
2013 - Druid A Real-time Analytical Data Store
2013 - Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask
2013 - F1: A Distributed SQL Database That Scales
2013 - GraphX: A Resilient Distributed Graph System on Spark
2013 - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality 2013 Estimation Algorithm
2013 - MillWheel: Fault-Tolerant Stream Processing at Internet Scale
2013 - MLbase: A Distributed Machine-learning System
2013 - Online, Asynchronous Schema Change in F1
2013 - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
2013 - Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013 - Scalable Progressive Analytics on Big Data in the Cloud
2013 - Scaling Memcache at Facebook
2013 - Scuba: Diving into Data at Facebook
2013 - Shark: SQL and Rich Analytics at Scale
2013 - Unicorn: A System for Searching the Social Graph

2012

2012 - A Few Useful Things to Know about Machine Learning
2012 - Blink and It's Done. Interactive Queries on Very Large Data
2012 - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
2012 - Dimension Independent Similarity Computation
2012 - Fast and Interactive Analytics over Hadoop Data with Spark
2012 - ImageNet Classification with Deep Convolutional Neural Networks
2012 - Large:Scale Machine Learning at Twitter
2012 - Paxos Made Parallel
2012 - Paxos Replicated State Machines as the Basis of a High-Performance Data Store
2012 - Processing a Trillion Cells per Mouse Click
2012 - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
2012 - Spanner: Google's Globally-Distributed Database
2012 - The Unified Logging Infrastructure for Data Analytics at Twitter
2012 - The Vertica Analytic Database- C-Store 7 Years Later

2011

2011 - CrowdDB: Answering Queries with Crowdsourcing
2011 - CrowdDB: Query Processing with the VLDB Crowd
2011 - Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
2011 - Matching Unstructured Product Offers to Structured Product Specifications
2011 - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
2011 - Resilient Distributed Datasets- A Fault-Tolerant Abstraction for In-Memory Cluster Computing
2011 - Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters

2010

2010 - Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
2010 - Dremel: Interactive Analysis of Web-Scale Datasets
2010 - Finding a needle in Haystack- Facebook's photo storage
2010 - FlumeJava: Easy, Eff¥cient Data-Parallel Pipelines
2010 - Large:scale Incremental Processing Using Distributed Transactions and Notifications
2010 - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
2010 - Pregel: A System for Large-Scale Graph Processing
2010 - S4: Distributed Stream Computing Platform
2010 - Spark: Cluster Computing with Working Sets
2010 - ZooKeeper: Wait-free coordination for Internet-scale systems

2009

2009 - Cassandra - A Decentralized Structured Storage System
2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
2009 - Vertical Paxos and Primary-Backup Replication

2008

2008 - Chukwa: A large-scale monitoring system
2008 - Column:Stores vs. Row-Stores- How Different Are They Really?
2008 - PNUTS: Yahoo!Õs Hosted Data Serving Platform
2008 - Top 10 algorithms in data mining

2007

2007 - Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
2007 - Dynamo: Amazon's Highly Available Key-value Store
2007 - Life beyond Distributed Transactions: an ApostateÕs Opinion
2007 - Paxos Made Live - An Engineering Perspective

2006

2006 - Bigtable: A Distributed Storage System for Structured Data
2006 - Ceph: A Scalable, High-Performance Distributed File System
2006 - Map-Reduce for Machine Learning on Multicore
2006 - The Chubby lock service for loosely-coupled distributed systems

2005

2005 - Fast Paxos

2004

2004 - Cheap Paxos
2004 - MapReduce: Simplified Data Processing on Large Clusters

2003

2003 - The Google File System

2002

2002 - Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

2001

2001 - Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
2001 - Paxos Made Simple
2001 - Random Forrest

1999

1999 - Pasting Small Votes for Classification in Large Databases and On-Line
1999 - The PageRank Citation Ranking: Bringing Order to the Web

Files

by_zenkay.md

Latest commit

History

by_zenkay.md

File metadata and controls

Big Data Ecosystem Dataset

分类

Frameworks

Distributed Programming

Distributed Filesystem

Key-Map Data Model

Document Data Model

Key-value Data Model

Graph Data Model

NewSQL Databases

Columnar Databases

Time-Series Databases

SQL-like processing

Integrated Development Environments

Data Ingestion

Message-oriented middleware

Service Programming

Scheduling

Machine Learning

Benchmarking

Security

System Deployment

Applications

Data Warehouse

Search engine and framework

MySQL forks and evolutions

PostgreSQL forks and evolutions

Memcached forks and evolutions

Embedded Databases

Business Intelligence

Data Visualization

Internet of things and sensor data

Papers

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

1999