Incomplete-but-useful list of big-data related projects packed into a JSON dataset.
- Main table: http://bigdata.andreamostosi.name
- Raw JSON data: http://bigdata.andreamostosi.name/data.json
- Original page on my blog: http://blog.andreamostosi.name/big-data/
Related projects:
- Hadoop Ecosystem Table by Javi Roman
- Awesome Big Data by Onur Akpolat
- Awesome Awesomeness by Alexander Bayandin
- Awesome Hadoop by Youngwoo Kim
- Queues.io by Łukasz Strzałkowski
- Frameworks
- Distributed Programming
- Distributed Filesystem
- Key-Map Data Model
- Document Data Model
- Key-value Data Model
- Graph Data Model
- NewSQL Databases
- Columnar Databases
- Time-Series Databases
- SQL-like processing
- Integrated Development Environments
- Data Ingestion
- Message-oriented middleware
- Service Programming
- Scheduling
- Machine Learning
- Benchmarking
- Security
- System Deployment
- Applications
- Data Warehouse
- Search engine and framework
- MySQL forks and evolutions
- PostgreSQL forks and evolutions
- Memcached forks and evolutions
- Embedded Databases
- Business Intelligence
- Data Visualization
- Internet of things and sensor data
- Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
- AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
- Akela - Mozilla's utility library for Hadoop, HBase, Pig, etc..
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- AMPLab Succinct - Enabling Queries on Compressed Data.
- Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
- Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
- Apache Flink - high-performance runtime, and automatic program optimization.
- Apache Gora - framework for in-memory data model and persistence.
- Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache S4 - framework for stream processing, implementation of S4.
- Apache Spark - framework for in-memory cluster computing.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Storm - framework for stream processing by Twitter also on YARN.
- Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour - MapReduce library for Clojure.
- Datasalt Pangool - alternative MapReduce paradigm.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- DistributedR - scalable high-performance platform for the R language.
- eBay Oink - REST based interface for PIG execution.
- Facebook Corona - Hadoop enhancement which removes single point of failure.
- Facebook Peregrine - Map Reduce framework.
- Facebook Scuba - distributed in-memory datastore.
- Geotrellis - geographic data processing engine for high performance applications.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Google MapReduce - map reduce framework.
- Google MillWheel - fault tolerant stream processing framework.
- HParser - data parsing transformation environment optimized for Hadoop.
- IBM Streams - advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
- JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Kyro - Java serialization and cloning: fast, efficient, automatic.
- Lipstick - Pig workflow visualization tool.
- Metamarkers Druid - framework for real-time analysis of large datasets.
- Netflix Aegisthus - Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
- Netflix Lipstick - Pig Visualization framework.
- Netflix Mantis - Event Stream Processing System.
- Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
- Netflix STAASH - language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems.
- Netflix Zeno - Netflix's In-Memory Data Propagation Framework.
- Nokia Disco - MapReduce framework developed by Nokia.
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
- Pinterest Pinlater - asynchronous job execution system.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
- ScaleOut hServer - fast, scalable in-memory data grid for Hadoop.
- SeqPig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop .
- SigmoidAnalytics Spork - Pig on Apache Spark.
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. .
- Spring for Apache Hadoop - unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive.
- SQLStream Blaze - stream processing platform.
- Stratio Streaming - the union of a real-time messaging bus with a complex event processing engine using Spark Streaming.
- Stratosphere - general purpose cluster computing framework.
- Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
- Teradata QueryGrid - data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop.
- TIBCO ActiveSpaces - in-memory data grid.
- Torch - Scientific computing for LuaJIT.
- Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
- Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
- Apache HDFS - a way to store large files across multiple machines.
- BeeGFS - formerly FhGFS, parallel distributed file system.
- Ceph Filesystem - software storage platform designed.
- Disco DDFS - distributed filesystem.
- Facebook Haystack - object storage system.
- Google Colossus - distributed filesystem (GFS2).
- Google GFS - distributed filesystem.
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- HDSF-DU - HDFS-DU is an interactive visualization of the Hadoop distributed file system. .
- Lustre file system - high-performance distributed filesystem.
- Netflix S3mper - library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
- Quantcast File System QFS - open-source distributed file system.
- Red Hat GlusterFS - scale-out network-attached storage file system.
- Tachyon - reliable file sharing at memory speed across cluster frameworks.
- Actian Vector - column-oriented analytic database.
- Apache Accumulo - distribuited key/value store, built on Hadoop.
- Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
- Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
- Facebook HydraBase - evolution of HBase made by Facebook.
- Google BigTable - column-oriented distributed datastore.
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Hypertable - column-oriented distribuited datastore, inspired by BigTable.
- InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
- Netflix Priam - Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
- OhmData C5 - improved version of HBase.
- Sqrrl - NoSQL databases on top of Apache Accumulo.
- Tephra - Transactions for HBase.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
- Actian Versant - commercial object-oriented database management systems .
- Crate Data - is an open source massively scalable data store. It requires zero administration.
- Facebook Apollo - Facebook’s Paxos-like NoSQL database.
- jumboDB - document oriented datastore over Hadoop.
- LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
- Microsoft DocumentDB - fully-managed, highly-scalable, NoSQL document database service.
- MongoDB - Document-oriented database system.
- RavenDB - A transactional, open-source Document Database.
- RethinkDB - document database that supports queries like table joins and group by.
- TokuMX - High-Performance MongoDB Distribution.
- Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies..
- Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
- Edis - is a protocol-compatible Server replacement for Redis.
- ElephantDB - Distributed database specialized in exporting data from Hadoop.
- EventStore - distributed time series database.
- HyperDex - next generation key-value store.
- LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
- Redis - in memory key value datastore.
- Redis Sentinel - system designed to help managing Redis instances.
- Riak - a decentralized datastore.
- Storehaus - library to work with asynchronous key value stores, by Twitter.
- Tarantool - an efficient NoSQL database and a Lua application server.
- TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
- Apache Giraph - implementation of Pregel, based on Hadoop.
- Apache Spark Bagel - implementation of Pregel, part of Spark.
- ArangoDB - multi model distribuited database.
- Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
- Faunus - Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
- Google Cayley - open-source graph database.
- Google Pregel - graph processing framework.
- GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
- GraphX - resilient Distributed Graph System on Spark.
- Gremlin - graph traversal Language.
- InfiniteGraph - distributed graph database.
- Infovore - RDF-centric Map/Reduce framework.
- Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- MapGraph - Massively Parallel Graph processing on GPUs.
- Neo4j - graph database writting entirely in Java.
- OrientDB - document and graph database.
- Phoebus - framework for large scale graph processing.
- Sparksee - scalable high-performance graph database.
- Titan - distributed graph database, built over Cassandra.
- Twitter FlockDB - distribuited graph database.
- Actian Ingres - commercially supported, open-source SQL relational database management system.
- BayesDB - statistic oriented SQL database.
- Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
- Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
- FoundationDB - distributed database, inspired by F1.
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
- HandlerSocket - NoSQL plugin for MySQL/MariaDB.
- IBM DB2 - object-relational database management system.
- InfiniSQL - infinity scalable RDBMS.
- MemSQL - in memory SQL database witho optimized columnar storage on flash.
- NuoDB - SQL/ACID compliant distributed database.
- Oracle Database - object-relational database management system.
- Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
- SAP HANA - is an in-memory, column-oriented, relational database management system.
- SenseiDB - distributed, realtime, semi-structured database.
- Sky - database used for flexible, high performance analysis of behavioral data.
- SymmetricDS - open source software for both file and database synchronization.
- Teradata Database - complete relational database management system.
- VoltDB - in-memory NewSQL database.
- Amazon RedShift - data warehouse service, based on PostgreSQL.
- C-Store - column oriented DBMS.
- Google BigQuery - framework for interactive analysis, implementation of Dremel.
- Google Dremel - framework for interactive analysis, implementation of Dremel.
- MonetDB - column store database.
- Parquet - columnar storage format for Hadoop.
- Pivotal Greenplum - purpose-built, dedicated analytic data warehouse.
- Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
- Cube - uses MongoDB to store time series data.
- InfluxDB - distributed time series database.
- Kairosdb - similar to OpenTSDB but allows for Cassandra.
- OpenTSDB - distributed time series database on top of HBase.
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- AMPLAB Shark - data warehouse system for Spark.
- Apache Drill - framework for interactive analysis, inspired by Dremel.
- Apache HCatalog - table and storage management layer for Hadoop.
- Apache Hive - SQL-like data warehouse system for Hadoop.
- Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
- Apache Phoenix - SQL skin over HBase.
- BlinkDB - massively parallel, approximate query engine.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Concurrent Lingual - SQL-like query language for Cascading.
- Datasalt Splout SQL - full SQL query engine for big datasets.
- Facebook PrestoDB - distributed SQL query engine.
- JethroData - index-based SQL engine for Hadoop.
- Metanautix Quest - data compute engine.
- Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
- RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
- Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
- SparkSQL - Manipulating Structured Data Using Spark.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Stinger - interactive query for Hive.
- Tajo - distributed data warehouse system on Hadoop.
- Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
- R-Studio - IDE for R.
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Apache Chukwa - data collection system.
- Apache Flume - service to manage large amount of log data.
- Apache Samza - stream processing framework, based on Kafla and YARN.
- Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
- Apache UIMA - Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.
- Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
- Facebook Scribe - streamed log data aggregator.
- Fluentd - tool to collect events and logs.
- Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- Heka - open source stream processing software system.
- HIHO - framework for connecting disparate data sources with Hadoop.
- LinkedIn Databus - stream of change capture events for a database.
- LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
- LinkedIn White Elephant - log aggregator and dashboard.
- Logstash - a tool for managing events and logs.
- Netflix Suro - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa.
- Pinterest Secor - is a service implementing Kafka log persistance.
- Record Breaker - Automatic structure for your text-formatted data.
- TIBCO Enterprise Message Service - standards-based messaging middleware.
- Twitter Zipkin - distributed tracing system that helps us gather timing data for all the disparate services at Twitter.
- Vibe Data Stream - streaming data collection for real-time Big Data analytics.
- ActiveMQ - open source messaging and Integration Patterns server.
- Amazon Simple Queue Service - fast, reliable, scalable, fully managed queue service.
- Apache Kafka - distributed publish-subscribe messaging system.
- Apache Qpid - messaging tools that speak AMQP and support many languages and platforms.
- Apollo - ActiveMQ's next generation of messaging.
- Beanstalkd - simple, fast work queue.
- Bit.ly NSQ - realtime distributed message processing at scale.
- Celery - Distributed Task Queue.
- Crossroads I/O - library for building scalable and high performance distributed applications.
- Darner - simple, lightweight message queue.
- Gearman - Job Server.
- HornetQ - open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system.
- IronMQ - easy-to-use highly available message queuing service.
- Kestrel - distributed message queue system.
- Marconi - queuing and notification service made by and for OpenStack, but not only for it.
- RabbitMQ - Robust messaging for applications.
- RestMQ - message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources.
- RQ - simple Python library for queueing jobs and processing them in the background with workers.
- Sidekiq - Simple, efficient background processing for Ruby.
- ZeroMQ - The Intelligent Transport Layer.
- Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
- Apache Avro - data serialization system.
- Apache Curator - Java libaries for Apache ZooKeeper.
- Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
- Apache Thrift - framework to build binary protocols.
- Apache Zookeeper - centralized service for process management.
- Google Chubby - a lock service for loosely-coupled distributed systems.
- Linkedin Norbert - cluster manager.
- MPICH - high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
- OpenMPI - message passing framework.
- Serf - decentralized solution for service discovery and orchestration.
- Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
- Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Twitter Elephant Bird - libraries for working with LZOP-compressed data.
- Twitter Finagle - asynchronous network stack for the JVM.
- Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
- Apache Falcon - data management framework.
- Apache Oozie - workflow job scheduler.
- Chronos - distributed and fault-tolerant scheduler.
- Linkedin Azkaban - batch workflow job scheduler.
- Pinterest Pinball - customizable platform for creating workflow managers.
- Sparrow - scheduling platform.
- Apache Mahout - machine learning library for Hadoop.
- Ayasdi Core - tool for topological data analysis.
- brain - Neural networks in JavaScript.
- Cloudera Oryx - real-time large-scale machine learning.
- Concurrent Pattern - machine learning library for Cascading.
- convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
- Decider - Flexible and Extensible Machine Learning in Ruby.
- etcML - text classification with machine learning.
- Etsy Conjecture - scalable Machine Learning in Scalding.
- Google Sibyl - System for Large Scale Machine Learning at Google.
- H2O - statistical, machine learning and math runtime for Hadoop.
- IBM Watson - cognitive computing system.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
- nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
- PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
- scikit-learn - scikit-learn: machine learning in Python.
- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- Sparkling Water - combine H2OÕs Machine Learning capabilities with the power of the Spark platform.
- Vahara - Machine learning and natural language processing with Apache Pig.
- Viv - global platform that enables developers to plug into and create an intelligent, conversational interface to anything.
- Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
- WEKA - suite of machine learning software.
- Wit - Natural Language for the Internet of Things.
- Wolfram Alpha - computational knowledge engine.
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark - real-world big data workload benchmark.
- Big-Bench - Big Bench Workload Development.
- Hive-benchmarks - some benchmarking queries for Apache Hive.
- Hive-testbench - Testbench for experimenting with Apache Hive at any data scale..
- Intel HiBench - a Hadoop benchmark suite.
- Netflix Inviso - performance focused Big Data tool.
- PUMA Benchmarking - benchmark suite for MapReduce applications.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
- Apache Argus - framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
- Apache Sentry - security module for data stored in Hadoop.
- PacketPig - Open Source Big Data Security Analytics.
- Voltage SecureData - data protection framework.
- Ankush - A big data cluster management tool that creates and manages clusters of different technologies..
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
- Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
- Apache Whirr - set of libraries for running cloud services.
- Apache YARN - Cluster manager.
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Cloudera HUE - web application for interacting with Hadoop.
- Deimos - Mesos containerizer hooks for Docker.
- Develoop - tool for provisioning, managing and monitoring Apache Hadoop.
- Facebook Autoscale - the load balancer will concentrate workload to a server until it has at least a medium-level workload.
- Facebook Prism - multi datacenters replication system.
- Ganglia Monitoring System - scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them..
- Google Borg - job scheduling and monitoring system.
- Google Omega - job scheduling and monitoring system.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting..
- Hortonworks HOYA - application that can deploy HBase cluster on YARN.
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs..
- Marathon - Mesos framework for long-running services.
- Adobe Spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
- Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
- Apache Nutch - open source web crawler.
- Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
- Apache Tika - content analysis toolkit.
- Domino - Run, scale, share, and deploy models Ñ without any infrastructure..
- Eclipse BIRT - Eclipse-based reporting system.
- Eventhub - open source event analytics platform.
- HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
- Hunk - Splunk analytics for Hadoop.
- MADlib - data-processing library of an RDBMS to analyze data.
- PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
- Sense - Cloud Platform for Data Science and Big Data Analytics.
- Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
- SparkR - R frontend for Spark.
- Splunk - analyzer for machine-generated date.
- Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
- Google Mesa - highly scalable analytic data warehousing system.
- IBM BigInsights - data processing, warehousing and analytics.
- Microsoft Cosmos - Microsoft's internal BigData analysis platform.
- Apache Lucene - Search engine library.
- Apache Solr - Search platform for Apache Lucene.
- ElasticSearch - Search and analytics engine based on Apache Lucene.
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig..
- Enigma.io - Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
- Facebook Unicorn - social graph search platform.
- Google Caffeine - continuous indexing system.
- Google Percolator - continuous indexing system.
- TeraGoogle - large search index.
- Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
- HBase Coprocessor - implementation of Percolator, part of HBase.
- hIndex - Secondary Index for HBase.
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
- LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
- LinkedIn Galene - search architecture at LinkedIn.
- LinkedIn Zoie - is a realtime search/indexing system written in Java.
- Sphnix Search Server - fulltext search engine.
- Amazon RDS - MySQL databases in Amazon's cloud.
- Drizzle - evolution of MySQL 6.0.
- Google Cloud SQL - MySQL databases in Google's cloud.
- MariaDB - enhanced, drop-in replacement for MySQL.
- MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
- Percona Server - enhanced, drop-in replacement for MySQL.
- ProxySQL - High Performance Proxy for MySQL.
- TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
- HadoopDB - hybrid of MapReduce and DBMS.
- IBM Netezza - high-performance data warehouse appliances.
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
- RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
- Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
- Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
- Facebook McDipper - key/value cache for flash storage.
- Facebook Memcached - fork of Memcache.
- Twemproxy - A fast, light-weight proxy for memcached and redis.
- Twitter Fatcache - key/value cache for flash storage.
- Twitter Twemcache - fork of Memcache.
- Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
- BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
- HamsterDB - transactional key-value database.
- HanoiDB - Erlang LSM BTree Storage.
- LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
- LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
- RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.
- ActivePivot - Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing.
- Adatao - business intelligence and data science platform.
- Apama analytics - platform for streaming analytics and intelligent automated action.
- Atigeo xPatterns - data analytics platform.
- BIME Analytics - business intelligence platform in the cloud.
- Chartio - lean business intelligence platform to visualize and explore your data.
- Datapine - self-service business intelligence tool in the cloud.
- Jaspersoft - powerful business intelligence suite.
- Jedox Palo - customisable Business Intelligence platform.
- Microsoft - business intelligence software and platform.
- Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- Pentaho - business intelligence platform.
- Qlik - business intelligence and analytics platform.
- SpagoBI - open source business intelligence platform.
- Spotfire - business intelligence platform.
- Tableau - business intelligence platform.
- Teradata Aster - Big Data Analytics.
- Tessera - Environment for Deep Analysis of Large Complex Data.
- Zeppelin - open source data analysis environment on top of Hadoop..
- Zoomdata - Big Data Analytics.
- Arbor - graph visualization library using web workers and jQuery.
- CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
- Chart.js - open source HTML5 Charts visualizations.
- Crossfilter - avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
- Cubism - JavaScript library for time series visualization.
- Cytoscape - JavaScript library for visualizing complex networks.
- D3 - javaScript library for manipulating documents.
- DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
- Envisionjs - dynamic HTML5 visualization.
- Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
- Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections.
- Google Charts - simple charting API.
- Grafana - graphite dashboard frontend, editor and graph composer.
- Graphite - scalable Realtime Graphing.
- Highcharts - simple and flexible charting API.
- IPython - provides a rich architecture for interactive computing.
- Keylines - toolkit for visualizing the networks in your data.
- Matplotlib - plotting with Python.
- NVD3 - chart components for d3.js.
- Peity - Progressive SVG bar, line and pie charts.
- Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots..
- Recline - simple but powerful library for building data applications in pure Javascript and HTML.
- Redash - open-source platform to query and visualize data.
- Sigma.js - JavaScript library dedicated to graph drawing.
- Vega - a visualization grammar.
- TempoIQ - Cloud-based sensor analytics.
- 2014 - 3D Object Manipulation in a Single Photograph using Stock 3D Models
- 2014 - A Partitioning Framework for Aggressive Data Skipping
- 2014 - DeepFace: Closing the Gap to Human-Level Performance in Face Verification
- 2014 - Fastpass: A Centralized "Zero-Queue" Datacenter Network
- 2014 - In Search of an Understandable Consensus Algorithm
- 2014 - Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
- 2014 - MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
- 2014 - Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
- 2014 - Orca A Modular Query Optimizer Architecture for Big Data
- 2014 - Pigeon: A Spatial MapReduce Language
- 2013 - A Demonstration of SpatailHadoop: An Efficient MapReduce Framework for Spatial Data
- 2013 - CG_Hadoop: Computational Geometry in MapReduce
- 2013 - Druid A Real-time Analytical Data Store
- 2013 - Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask
- 2013 - F1: A Distributed SQL Database That Scales
- 2013 - GraphX: A Resilient Distributed Graph System on Spark
- 2013 - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality 2013 Estimation Algorithm
- 2013 - MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- 2013 - MLbase: A Distributed Machine-learning System
- 2013 - Online, Asynchronous Schema Change in F1
- 2013 - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
- 2013 - Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- 2013 - Scalable Progressive Analytics on Big Data in the Cloud
- 2013 - Scaling Memcache at Facebook
- 2013 - Scuba: Diving into Data at Facebook
- 2013 - Shark: SQL and Rich Analytics at Scale
- 2013 - Unicorn: A System for Searching the Social Graph
- 2012 - A Few Useful Things to Know about Machine Learning
- 2012 - Blink and It's Done. Interactive Queries on Very Large Data
- 2012 - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- 2012 - Dimension Independent Similarity Computation
- 2012 - Fast and Interactive Analytics over Hadoop Data with Spark
- 2012 - ImageNet Classification with Deep Convolutional Neural Networks
- 2012 - Large:Scale Machine Learning at Twitter
- 2012 - Paxos Made Parallel
- 2012 - Paxos Replicated State Machines as the Basis of a High-Performance Data Store
- 2012 - Processing a Trillion Cells per Mouse Click
- 2012 - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
- 2012 - Spanner: Google's Globally-Distributed Database
- 2012 - The Unified Logging Infrastructure for Data Analytics at Twitter
- 2012 - The Vertica Analytic Database- C-Store 7 Years Later
- 2011 - CrowdDB: Answering Queries with Crowdsourcing
- 2011 - CrowdDB: Query Processing with the VLDB Crowd
- 2011 - Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- 2011 - Matching Unstructured Product Offers to Structured Product Specifications
- 2011 - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- 2011 - Resilient Distributed Datasets- A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- 2011 - Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters
- 2010 - Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- 2010 - Dremel: Interactive Analysis of Web-Scale Datasets
- 2010 - Finding a needle in Haystack- Facebook's photo storage
- 2010 - FlumeJava: Easy, Eff¥cient Data-Parallel Pipelines
- 2010 - Large:scale Incremental Processing Using Distributed Transactions and Notifications
- 2010 - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- 2010 - Pregel: A System for Large-Scale Graph Processing
- 2010 - S4: Distributed Stream Computing Platform
- 2010 - Spark: Cluster Computing with Working Sets
- 2010 - ZooKeeper: Wait-free coordination for Internet-scale systems
- 2009 - Cassandra - A Decentralized Structured Storage System
- 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- 2009 - Vertical Paxos and Primary-Backup Replication
- 2008 - Chukwa: A large-scale monitoring system
- 2008 - Column:Stores vs. Row-Stores- How Different Are They Really?
- 2008 - PNUTS: Yahoo!Õs Hosted Data Serving Platform
- 2008 - Top 10 algorithms in data mining
- 2007 - Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
- 2007 - Dynamo: Amazon's Highly Available Key-value Store
- 2007 - Life beyond Distributed Transactions: an ApostateÕs Opinion
- 2007 - Paxos Made Live - An Engineering Perspective
- 2006 - Bigtable: A Distributed Storage System for Structured Data
- 2006 - Ceph: A Scalable, High-Performance Distributed File System
- 2006 - Map-Reduce for Machine Learning on Multicore
- 2006 - The Chubby lock service for loosely-coupled distributed systems
- 2005 - Fast Paxos
- 2003 - The Google File System
- 2002 - Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services