unstructured data free download

Showing 66 open source projects for "unstructured data"

View related business solutions

Relax: PRTG Monitors Your IT for You
Stay in control and avoid IT headaches. PRTG monitors your network, devices, and apps - receive alerts when it matters most.

You’re the go-to IT person, always putting out fires and keeping things running. With PRTG, you get reliable alerts to monitor your entire IT infrastructure, without the noise. Our intuitive setup gives you a clear overview of your network, devices, and applications in real time. Get instant alerts only when something needs your attention, whether you’re at your desk or on the move. Spend less time worrying about outages and more time focusing on what matters. Set up PRTG once and let it work for you - PRTG has you covered.

Start Your Free PRTG Trial Now
Fully managed relational database service for MySQL, PostgreSQL, and SQL Server
Focus on your application, and leave the database to us

Cloud SQL manages your databases so you don't have to, so your business can run without disruption. It automates all your backups, replication, patches, encryption, and storage capacity increases to give your applications the reliability, scalability, and security they need.

Try for free
1

Unstructured.IO

Open source libraries and APIs to build custom preprocessing pipelines

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data...

Downloads: 0 This Week

Last Update: 2025-09-17
See Project
2

MeshLab

The open source mesh processing system

MeshLab is an open-source, portable, and extensible system for the processing and editing of unstructured large 3D triangular meshes. It is aimed to help the processing of the typical not-so-small unstructured models arising in 3D scanning, providing a set of tools for editing, cleaning, healing, inspecting, rendering and converting this kind of meshes. MeshLab is mostly based on the open source C++ mesh processing library VCGlib developed at the Visual Computing Lab of ISTI - CNR. VCG can...

Downloads: 43 This Week

Last Update: 2025-07-22
See Project
3

Milvus

Vector database for scalable similarity search and AI applications

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility. Average latency measured in milliseconds on trillion...

Downloads: 11 This Week

Last Update: 2025-10-11
See Project
4

LlamaIndex

Central interface to connect your LLM's with external data

LlamaIndex (GPT Index) is a project that provides a central interface to connect your LLM's with external data. LlamaIndex is a simple, flexible interface between your external data and LLMs. It provides the following tools in an easy-to-use fashion. Provides indices over your unstructured and structured data for use with LLM's. These indices help to abstract away common boilerplate and pain points for in-context learning. Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when...

Downloads: 9 This Week

Last Update: 4 days ago
See Project
Create and run cloud-based virtual machines.
Secure and customizable compute service that lets you create and run virtual machines.

Computing infrastructure in predefined or custom machine sizes to accelerate your cloud transformation. General purpose (E2, N1, N2, N2D) machines provide a good balance of price and performance. Compute optimized (C2) machines offer high-end vCPU performance for compute-intensive workloads. Memory optimized (M2) machines offer the highest memory and are great for in-memory databases. Accelerator optimized (A2) machines are based on the A100 GPU, for very demanding applications.

Try for free
5

Parsera

Lightweight library for scraping web-sites with LLMs

Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.

Downloads: 7 This Week

Last Update: 2025-10-08
See Project
6

Obsei

Obsei is a low code AI powered automation tool

Obsei is an automated no-code/low-code AI-powered text observation and analysis framework, designed for extracting insights from unstructured text data such as social media, reviews, and logs.

Downloads: 6 This Week

Last Update: 2025-01-24
See Project
7

DataProfiler

Extract schema, statistics and entities from datasets

DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI...

Downloads: 3 This Week

Last Update: 2025-07-30
See Project
8

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 2 This Week

Last Update: 2025-09-18
See Project
9

RAGFlow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.

Downloads: 7 This Week

Last Update: 5 days ago
See Project
Turn speech into text using Google AI
Accurately convert voice to text in over 125 languages and variants by applying powerful machine learning models with an easy-to-use API.

New customers get $300 in free credits to spend on Speech-to-Text. All customers get 60 minutes for transcribing and analyzing audio free per month, not charged against your credits.

Try for free
10

Gretel Synthetics

Synthetic data generators for structured and unstructured text

Unlock unlimited possibilities with synthetic data. Share, create, and augment data with cutting-edge generative AI. Generate unlimited data in minutes with synthetic data delivered as-a-service. Synthesize data that are as good or better than your original dataset, and maintain relationships and statistical insights. Customize privacy settings so that data is always safe while remaining useful for downstream workflows. Ensure data accuracy and privacy confidently with expert-grade reports...

Downloads: 5 This Week

Last Update: 2025-03-17
See Project
11

Diffgram

Training data (data labeling, annotation, workflow) for all data types

... the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.

Downloads: 5 This Week

Last Update: 2024-10-14
See Project
12

MindsDB

Making Enterprise Data Intelligent and Responsive for AI

MindsDB is an AI data solution that enables humans, AI, agents, and applications to query data in natural language and SQL, and get highly accurate answers across disparate data sources and types. MindsDB connects to diverse data sources and applications, and unifies petabyte-scale structured and unstructured data. Powered by an industry-first cognitive engine that can operate anywhere (on-prem, VPC, serverless), it empowers both humans and AI with highly informed decision-making...

Downloads: 6 This Week

Last Update: 2025-09-04
See Project
13

CrateDB

CrateDB is a distributed and scalable SQL database

CrateDB is a distributed SQL database designed for massive machine data and real-time analytics. It combines the scalability and performance of NoSQL with the power and simplicity of SQL, allowing for horizontal scaling, full-text search, and complex queries over large datasets. Built in Java and powered by Elasticsearch and Lucene, CrateDB is optimized for high-velocity data ingestion and dynamic queries.

Downloads: 3 This Week

Last Update: 6 days ago
See Project
14

Pimcore

Open Source Data & Experience Management Platform

No matter if you're dealing with unstructured web documents or structured data for MDM/PIM, you define the UI design (web documents by a template and structured data with an intuitive graphical editor), Pimcore knows how to persist the data efficiently and optimized for fast access. Due to the framework approach, Pimcore is very flexible and adapts perfectly to your needs. Built on top of the well-known Symfony Framework you have a solid and modern foundation for your project. Benefit from all...

Downloads: 3 This Week

Last Update: 2025-10-08
See Project
15

TextFSM

Python module for parsing semi-structured text into python tables

TextFSM is a Python library created by Google that provides a template-based state machine engine for parsing semi-structured text. It is particularly useful for extracting structured data from command-line interface (CLI) outputs, such as those from network devices, routers, and switches. By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. Each template...

Downloads: 3 This Week

Last Update: 2025-10-11
See Project
16

Pachyderm

Data-Centric Pipelines and Data Versioning

Data-driven pipelines automatically trigger based on detecting data changes. Automatic immutable data lineage and data versioning of all data types. Autoscaling and parallel processing built on Kubernetes for resource orchestration. Uses standard object stores for data storage with automatic deduplication. Runs across all major cloud providers and on-premises installations. Automatic and intelligent versioning of even the largest data sets of unstructured and structured data. Git-like...

Downloads: 4 This Week

Last Update: 2025-01-15
See Project
17

Airweave

Airweave lets agents search any app

Airweave is an open-source platform that enables agents to semantically search across various applications, databases, and APIs. By transforming disparate data sources into a unified, searchable knowledge base, Airweave facilitates intelligent information retrieval through REST APIs or the MCP protocol. It's particularly useful for building AI agents that require access to structured and unstructured data across multiple platforms.

Downloads: 3 This Week

Last Update: 2 days ago
See Project
18

XTDB

General-purpose bitemporal database for SQL, Datalog & graph queries

... the relational model with SQL and the graph model with Datalog, over the same data. Explore all your data. Across all implicit relationships, across all time. As a document-oriented database, XTDB makes your data immediately available without the need for an upfront schema. Both structured and unstructured data are at home in XTDB. Legal regulations like GDPR often pose a challenge when designing systems around immutable data.

Downloads: 4 This Week

Last Update: 2025-06-11
See Project
19

LlamaParse

Parse files for optimal RAG

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.

Downloads: 1 This Week

Last Update: 3 days ago
See Project
20

Text Search Engine

A text search engine that supports mixed Chinese and English search

Text-Search-Engine is a JavaScript-based lightweight search engine that enables full-text search functionality. It allows developers to implement fast search indexing and retrieval in web applications.

Downloads: 2 This Week

Last Update: 2025-07-23
See Project
21

Search-Index

A persistent, network resilient, full text search library

Search-Index is a lightweight and fast JavaScript-based search engine that enables full-text search indexing and retrieval for web applications.

Downloads: 2 This Week

Last Update: 2025-03-12
See Project
22

marqo

Tensor search for humans

... and text-to-image search and analytics. Marqo adapts and stores your data in a fully schemaless manner. It combines tensor search with a query DSL that provides efficient pre-filtering. Tensor search allows you to go beyond keyword matching and search based on the meaning of text, images and other unstructured data. Be a part of the tribe and help us revolutionize the future of search. Whether you are a contributor, a user, or simply have questions about Marqo, we got your back.

Downloads: 4 This Week

Last Update: 2025-10-10
See Project
23

GeoStats.jl

An extensible framework for geospatial data science

GeoStats.jl is a Julia framework for geospatial data science and geostatistical modeling. It’s fully implemented in Julia and designed to provide an extensible, high-performance stack that handles spatial domains, interpolation, simulation, learning, and visualization. The package is modular: it breaks out geometry, spatial domains, transforms, variograms, covariance models, and modeling into subpackages (e.g., GeoStatsBase, GeoStatsModels, GeoStatsTransforms). Users can represent georeferenced...

Downloads: 4 This Week

Last Update: 3 days ago
See Project
24

Gridap.jl

Grid-based approximation of partial differential equations in Julia

Gridap provides a set of tools for the grid-based approximation of partial differential equations (PDEs) written in the Julia programming language. The library currently supports linear and nonlinear PDE systems for scalar and vector fields, single and multi-field problems, conforming and nonconforming finite element (FE) discretizations, on structured and unstructured meshes of simplices and n-cubes. It also provides methods for time integration. Gridap is extensible and modular. One can...

Downloads: 3 This Week

Last Update: 2025-09-25
See Project
25

OpenWPM

A web privacy measurement framework

OpenWPM is a web privacy measurement framework that makes it easy to collect data for privacy studies on a scale of thousands to millions of websites. OpenWPM is built on top of Firefox, with automation provided by Selenium. It includes several hooks for data collection. Check out the instrumentation section below for more details. OpenWPM is tested on Ubuntu 18.04 via TravisCI and is commonly used via the docker container that this repo builds, which is also based on Ubuntu. Although we don't...

Downloads: 3 This Week

Last Update: 2025-01-19
See Project