Page 3 | Best Open Source Big Data Tools 2025

Big Data Tools

View 283 business solutions

Big Data Clear Filters

See Everything. Miss Nothing. 30-day free trial
Don’t let IT surprises catch you off guard. PRTG keeps an eye on your whole network, so you don’t have to.

As the IT backbone of your company, you can’t afford to miss a thing. PRTG monitors every device, application, and connection - on-premise and in the cloud. You get clear dashboards, smart alerts, and mobile access, so you’re always in control, wherever you are. No more guesswork or manual checks. PRTG’s powerful automation and easy setup mean you spend less time firefighting and more time moving your business forward. Discover how simple and reliable IT monitoring can be.

Try PRTG 30-day full access trial
Run applications fast and securely in a fully managed environment
Cloud Run is a fully-managed compute platform that lets you run your code in a container directly on top of scalable infrastructure.

Run frontend and backend services, batch jobs, deploy websites and applications, and queue processing workloads without the need to manage infrastructure.

Try for free
1

Apache InLong

Apache InLong - a one-stop integration framework for massive data

Apache InLong is a one-stop integration framework for massive data that provides automatic, secure and reliable data transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great power to build data analysis, modeling and other real-time applications based on streaming data. InLong (应龙) is a divine beast in Chinese mythology who guides the river into the sea, and it is regarded as a metaphor of the InLong system for reporting data streams. InLong was originally built at Tencent, which has served online businesses for more than 8 years, to support massive data (data scale of more than 80 trillion pieces of data per day) reporting services in big data scenarios. The entire platform has integrated 5 modules: Ingestion, Convergence, Caching, Sorting, and Management, so that the business only needs to provide data sources, data service quality, data landing clusters and data landing formats.

Downloads: 0 This Week

Last Update: 2025-06-15
See Project
2

Big Sack

Big Sack: A lightweight Java Key/Value store with undo and disk cache.

Big Sack is a Java persistence mechanism that allows storage of key value pairs following the popular Big Data paradigms. Its a very simple and straightforward way to bridge the gap between in-memory data structures and long-term storage. It has the convenience of Java SDK TreeMap and TreeSet classes and is used the same easy way, but it includes rollback through undo logging to checkpoint data so it does not wind up in an unknown state regardless of failures. Data storage in the exabyte range is possible using filesystem and/or memory-mapped IO. Three levels of configurable write-through caching at different granularities ensure performance.

Downloads: 0 This Week

Last Update: 2013-12-21
See Project
3

Blue Whale Configuration Platform

Blue Whale smart cloud configuration platform

Has accumulated experience in supporting hundreds of Tencent businesses, compatible with various complex system architectures, born in operation and maintenance, and proficient in operation and maintenance. From configuration management to job execution, task scheduling and monitoring self-healing, and then through operation and maintenance big data analysis to assist operational decision-making, it covers the full-cycle assurance management of business operations in a comprehensive manner. The open PaaS has a powerful development framework and scheduling engine, as well as a complete operation and maintenance development training system, which helps the rapid transformation and upgrading of operation and maintenance. Through the Blue Whale intelligent cloud system, it can help enterprises quickly realize the automation of basic operation and maintenance services, thereby accelerating the transformation of DevOps, realizing a tool culture, and maximizing operational efficiency.

Downloads: 0 This Week

Last Update: 2025-05-30
See Project
4

Chinese I Ching R Algorithms

Chinese I Ching Algorithms implemented with R programming

I Ching offers an idea to summary the world by constructing functional mappings in finite groups. Much of ancient Chinese natural and social science grew from its concepts as root. The derived theories like "Eight Diagrams", "Five Elements" became the foundation of nearly all academic fields in acient Chinese. As a result, the theories are also the high concentration of many practical and statistical experience in China during her thousands-of-year history. This project tries to realize these algorithms with statistical programming. Overcoming the long span has important meanings. First, we can verify the validity of algorithm with big data. Second, we can put them into unprecedented wide application in the age of data. Third, the unique inspection of I Ching that linked "all the universe" open new eyes for us in thought and methodology about interdisciplinary use of data. This program is open-source for human. If you are interested in it, welcome to discuss and work together.

Downloads: 0 This Week

Last Update: 2016-08-06
See Project
Find out just how much your login box can do for your customer | Auth0
With over 53 social login options, you can fast-track the signup and login experience for users.

From improving customer experience through seamless sign-on to making MFA as easy as a click of a button – your login box must find the right balance between user convenience, privacy and security.

Sign up
5

Chinese I Ching R Algorithms

Chinese I Ching Algorithms implemented with R programming

I Ching offers an idea to summary the world by constructing functional mappings in finite groups. Much of ancient Chinese natural and social science grew from its concepts as root. The derived theories like "Eight Diagrams", "Five Elements" became the foundation of nearly all academic fields in acient Chinese. As a result, the theories are also the high concentration of many practical and statistical experience in China during her thousands-of-year history. This project tries to realize these algorithms with statistical programming. Overcoming the long span has important meanings. First, we can verify the validity of algorithm with big data. Second, we can put them into unprecedented wide application in the age of data. Third, the unique inspection of I Ching that linked "all the universe" open new eyes for us in thought and methodology about interdisciplinary use of data. This program is open-source for human. If you are interested in it, welcome to discuss and work together.

Downloads: 0 This Week

Last Update: 2015-04-01
See Project
6

Chordalysis

Log-linear analysis (data modelling) for high-dimensional data

===== Project moved to https://github.com/fpetitjean/Chordalysis ===== Log-linear analysis is the statistical method used to capture multi-way relationships between variables. However, due to its exponential nature, previous approaches did not allow scale-up to more than a dozen variables. We present here Chordalysis, a log-linear analysis method for big data. Chordalysis exploits recent discoveries in graph theory by representing complex models as compositions of triangular structures, also known as chordal graphs. Chordalysis makes it possible to discover the structure of datasets with thousands of variables on a standard desktop computer. Associated papers at ICDM 2013, ICDM 2014 and SDM 2015 can be found at http://www.francois-petitjean.com/Research/ YourKit is supporting Chordalysis open source project with its full-featured Java Profiler. YourKit is the creator of innovative and intelligent tools for profiling Java and .NET applications. http://www.yourkit.com

Downloads: 0 This Week

Last Update: 2015-01-29
See Project
7

CrispI

A hybrid graph database with analysis

A graph database layer in Java which allows mixed-mode database handling. For example, using an underlying oodb, a OneToMany and ManyToOne relationships can be implemented allowing quick and robust hierarchies to be built (a subset of a graph). The system also includes a Big Data implementation, other analytics and visualization. The current invokation is based on a Versant Db. Examples for use can be found in CrispI-Examples.

Downloads: 0 This Week

Last Update: 2016-03-28
See Project
8

Cube Platform

Cube Platform is a decentralized grid computing system that uses P2P Pastry protocol for communication between nodes. It's a big data storage written in Java.

Downloads: 0 This Week

Last Update: 2013-04-23
See Project
9

Custom Apache Big data Distribution

A Custom Apache Distribution including Spark and Hadoop, for Windows.

This Distribution has been customized to work out of the box. So, just download it, and unzip it. Set the Path variables for bin folders, HADOOP_HOME, SPARK_HOME, and JAVA_HOME. That's it..! use Hadoop and Spark natively on Windows.

Downloads: 0 This Week

Last Update: 2020-03-11
See Project
Cloud data warehouse to power your data-driven innovation
BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.

BigQuery Studio provides a single, unified interface for all data practitioners of various coding skills to simplify analytics workflows from data ingestion and preparation to data exploration and visualization to ML model creation and use. It also allows you to use simple SQL to access Vertex AI foundational models directly inside BigQuery for text processing tasks, such as sentiment analysis, entity extraction, and many more without having to deal with specialized models.

Try for free
10

DSTK - DataScience ToolKit

DSTK - DataScience ToolKit for All of Us

DSTK - DataScience ToolKit is an opensource free software for statistical analysis, data visualization, text analysis, and predictive analytics. Newer version and smaller file size can be found at: https://sourceforge.net/projects/dstk3/ It is designed to be straight forward and easy to use, and familar to SPSS user. While JASP offers more statistical features, DSTK tends to be a broad solution workbench, including text analysis and predictive analytics features. Of course you may specify JASP for advanced data editing and RapidMiner for advanced prediction modeling. DSTK is written in C#, Java and Python to interface with R, NLTK, and Weka. It can be expanded with plugins using R Scripts. We have also created plugins for more statistical functions, and Big Data Analytics with Microsoft Azure HDInsights (Spark Server) with Livy. License: R, RStudio, NLTK, SciPy, SKLearn, MatPlotLib, Weka, ... each has their own licenses.

Downloads: 0 This Week

Last Update: 2018-05-08
See Project
11

Exl2Sql

Excel to SQL

This tool will convert an Excel Spreadsheet (.xls and .xlsx files) into SQL INSERTs to one table. The first row of your excel sheet will be used as the column names so you cannot have any NULL values. Then the data underneath the column name is applied into that column with the generated insert statement. You can Save or Copy the data and then use Find and Replace if you need to tweak. Good for big data. Needed this for my work so created over the weekend, happy to share with the community. Requires .NET framework on your PC. If you're on windows you're okay.

2 Reviews

Downloads: 0 This Week

Last Update: 2018-04-18
See Project
12

FrincBackup

Incremtal backup tool supporting removable storage devices

FrincBackup means free incremental backup. It is developed for backing up a x TB NAS with storage devices in a logical volume to multiple removable storage devices, such as 500 GB USB hard drives. Files are backuped as files (not as an archive) and are readable without the need of a tool and without the need of FrincBackup itself (allthough there is a restore mode for better handling).

Downloads: 0 This Week

Last Update: 2014-07-18
See Project
13

GOBIG

GOBIG is a toolbox that can be used for detecting genetic variations. The project is intended to handle big data. What's more important is that it be used to detect clusters of SNP variants. It is the intention to use the toolbox with common and rare variants. To use it, for example, to find the genetic map of genes causing complex diseases.

Downloads: 0 This Week

Last Update: 2015-09-10
See Project
14

Genie

Distributed Big Data Orchestration Service

Genie is a completely open source distributed job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.

Downloads: 0 This Week

Last Update: 2025-08-05
See Project
15

GnuCopy

GnuCopy is an Open-Source tool to copy and archive all your important data. It supports all important archive typs like Zip and Tar to guaranty an easy and secure exchange between all types of operating systems. Additionally, you can create profiles to blacklist or whitelist specific file types or folders to seperate your big data stores for backups.

Downloads: 0 This Week

Last Update: 2023-07-28
See Project
16

GridDB

GridDB is a next-generation open source database

A cyber-physical systems is a system that collects a variety of data in physical space (the real world), analyzes and converts it into knowledge in cyberspace, and feeds the knowledge back to the real world to revitalize industry and solve social problems. GridDB is an open database that enables real-time processing of vast amounts of time-series data in physical space, which is necessary to realize a cyber-physical system. Multi-model architecture capable of supporting various data stores with time-series data-oriented and pluggable data stores for efficient real-time processing and management of huge amounts of time-series data at high frequency. Various architectural innovations, such as in-memory orientation with "memory as the main unit and disk as the secondary unit" and event-driven design with minimal overhead, have been incorporated to achieve processing capabilities that can handle petabyte-scale applications.

Downloads: 0 This Week

Last Update: 2025-06-03
See Project
17

HSRA

Hadoop spliced read aligner for RNA-seq data

HSRA is a MapReduce-based parallel tool for mapping reads from RNA sequencing (RNA-seq) experiments. RNA-seq analyses typically begin by mapping reads to a reference genome in order to determine the location from which the reads were originated, which is a very time-consuming step. This tool allows bioinformatics researchers to efficiently distribute their mapping tasks over the nodes of a cluster by combining a fast multithreaded spliced aligner (HISAT2) with Apache Hadoop, which is a distributed computing framework for scalable Big Data processing. HSRA currently supports single-end and paired-end read alignments from FASTQ/FASTA datasets. Moreover, our tool uses the Hadoop Sequence Parser (HSP) library (link above) to efficiently read the input datasets stored on the Hadoop Distributed File System (HDFS), being able to process datasets compressed with Gzip and BZip2 codecs.

Downloads: 0 This Week

Last Update: 2019-01-23
See Project
18

JDataStructure

The idea about this project is use it to create and provide a big data structures library, like trees, list and other.

Downloads: 0 This Week

Last Update: 2015-08-12
See Project
19

LEACrypt

TTAK.KO-12.0223 Lightweight Encryption Algorithm Tool

The Lightweight Encryption Algorithm (also known as LEA) is a 128-bit block cipher developed by South Korea in 2013 to provide confidentiality in high-speed environments such as big data and cloud computing, as well as lightweight environments such as IoT devices and mobile devices. LEA is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP) and is the national standard of Republic of Korea (KS X 3246). LEA is included in the ISO/IEC 29192-2:2019 standard (Information security - Lightweight cryptography - Part 2: Block ciphers). This project is licensed under the ISC License. Copyright © 2020-2021 ALBANESE Research Lab Source code: https://github.com/pedroalbanese/leacrypt Visit: http://albanese.atwebpages.com

Downloads: 0 This Week

Last Update: 2022-12-16
See Project
20

LogicalSets

Integrated Comprehensive Data Architecture & Methodology

This is an advanced data architecture and methodology. A comprehensive Enterprise Resource Management System. A re-usable database with rules for customization, While being a data driven transaction processing engine, this system has very advanced reporting capabilities. This design eliminates up to 90% of business logic due to the way the data is structured. Uses a concept called Table Sets. Has a compound key that tells the programmer what tableset, which record which applet will view/edit the data. Developed in SAP PowerDesigner, for (Sybase) SQL Anywhere. Don't let the date fool you, this system is ahead of its time.

Downloads: 0 This Week

Last Update: 2021-12-06
See Project
21

MapReduce Brazil

Aggregates MapReduce projects

Nowadays the production and storage of Big Data is common, both in the academy and in the enterprises. To process this huge amount of data it is essential the use of high performance platforms and programming models like MapReduce

Downloads: 0 This Week

Last Update: 2015-08-26
See Project
22

MarDRe

MapReduce-based tool to remove duplicate DNA reads

MarDRe is a de novo MapReduce-based parallel tool to remove duplicate and near-duplicate DNA reads through the clustering of single-end and paired-end sequences from FASTQ/FASTA datasets. This tool allows bioinformatics to avoid the analysis of not necessary reads, reducing the time of subsequent procedures with the dataset. MarDRe is the Big Data counterpart of ParDRe (link above), which employs HPC technologies (i.e., hybrid MPI/multithreading) to reduce runtime on multicore systems. Instead, MarDRe takes advantage of the MapReduce programming model to significantly improve ParDRe performance on distributed systems, especially on cloud-based infrastructures. Written in pure Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for Big Data processing.

Downloads: 0 This Week

Last Update: 2019-01-23
See Project
23

Neuro

The Neuro crypto currency

The Neuro NRO cryptocurrency is designed to support solutions of machine learning tasks, big data and neural networks. Neuro is a scientific-technical project uniting scientists, engineers and programmers inspired by the idea to build something big, kind and bright. From the first stages of work, we will be engaged in the development of new architectures and algorithms of neural networks. Someday we will undoubtedly enter the annual ImageNet Challenge contest to compete with such giants as GoogLeNet Inception and Microsoft ResNet. At further stages of the work, we adapt the neural networks to calculate molecular interactions in protein environments. Our system will help to look for new types of drugs for cancer, Alzheimer's and other serious problems of modern medicine. We plan to make a serious contribution to the increase of human life expectancy.

Downloads: 0 This Week

Last Update: 2019-07-29
See Project
24

OCW Test - Out of Commerce Works

Program for out of commerce works detection

The OCW Test program has been designed to provide assistance in the detection of works outside trade, taking as reference a list of works from a specific bibliographic catalog. In this first version, the program operates on the identifiers of the books of the library of the Complutense University of Madrid. However, the program can be reedited, to work on any bibliographic catalog.

Downloads: 0 This Week

Last Update: 2019-03-24
See Project
25

ODD Platform

First open-source data discovery and observability platform

Unlock the power of big data with OpenDataDiscovery Platform. Experience seamless end-to-end insights, powered by unprecedented observability and trust - from ingestion to production - while building your ideal tech stack! Democratize data and accelerate insights. Find data that fits your use case and discover hints left by your peers to leverage existing knowledge. Explore tags, ownership details, links to other sources and other information to shorten and simplify data discovery phase. Forget unnerved stakeholders and wasting too much time on digging the root cause of data issues when it fails. With ODD’s automatic company-wide ingestion-to-product lineage you’ll have answers in just seconds and stakeholders won’t need to wait. Sleep well, knowing all your data is in check. Forget manual testing, days of debugging, and weeks of worrying. Know the impact of each code change with automatic testing. Enjoy lineage and alerts powered with data quality information.

Downloads: 0 This Week

Last Update: 2025-02-19
See Project