Apache Spark

Technology, Information and Internet

Berkeley, CA 21,856 followers

Unified engine for large-scale data analytics

Discover all 10 employees

About us

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. - SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. - Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling - Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines. The most widely-used engine for scalable computing Thousands of companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to the open source project from industry and academia. Ecosystem Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.

Website: https://spark.apache.org/
External link for Apache Spark
Industry: Technology, Information and Internet
Company size: 51-200 employees
Headquarters: Berkeley, CA
Type: Nonprofit
Specialties: Apache Spark, Big Data, Machine Learning, SQL Analytics, Batch, and Streaming

Locations

Primary

Berkeley, CA, US

Get directions

Employees at Apache Spark

See all employees

Updates

Apache Spark

21,856 followers
4d
Report this post
Curious about GPU acceleration in Spark with no code changes? In this upcoming session, we’ll walk through how the new Spark Connect extensions for ML, along with Spark SQL’s plugin interface, integrate seamlessly with NVIDIA open source GPU-accelerated libraries. Expect to learn: 🔹 How Spark Connect improves adoption across JVM and non-JVM clients 🔹 The benefits of NVIDIA’s GPU acceleration for Spark SQL and MLlib 🔹 A pattern for end-to-end accelerated ETL + ML in the cloud lakehouse setting 📅 October 29 ⏰ 9:30–10:30 AM PT 📍 Streaming on LinkedIn, X, and YouTube RSVP here 👉 https://luma.com/0zffq605 🎤 Erik Ordentlich, German "Gera" Shegalov, hosted by Jules Damji #opensource #sparkconnect #nvidia #apachespark #oss

Apache Spark

21,856 followers
2w

Join us on October 29 at 9:30 AM PT for a deep dive into Apache Spark™, Spark Connect, and Spark ML at NVIDIA! We’ll explore how Spark Connect, starting in Spark 3.4 and extended to MLlib in Spark 4.0+, enables new client connections and stability benefits. ✅ The session will also highlight how NVIDIA GPU-accelerated plugins for Spark SQL and ML deliver end-to-end acceleration with no code changes—with performance up to 9x faster at 80% lower cost. 📅 October 29 ⏰ 9:30–10:30 AM P 📍 Live online (LinkedIn, X & YouTube) RSVP here 👉https://luma.com/0zffq605 🎤 Speakers: Erik Ordentlich & German "Gera" Shegalov, hosted by Jules Damji #opensource #sparkconnect #nvidia #apachespark #oss

Spark Connect: NVIDIA Accelerator for Spark SQL and MLlib

www.linkedin.com

Like Comment Share
Apache Spark

21,856 followers
1w
Report this post
Spark Connect, introduced in Apache Spark™ 3.4, brings a lighter, more flexible connection model for Spark applications. It is extended further to machine learning in Spark 4.0+. ✅ On October 29 at 9:30 AM PT, we’ll show how Spark Connect, combined with NVIDIA’s open source accelerated plugins, can bring GPU-powered ETL and ML pipelines to your lakehouse architecture with significant performance and cost benefits. This session will cover: 🔹 A working pattern for Spark Connect with accelerated ETL and ML for lakehouses 🔹 Practical architecture considerations for data engineering and ML pipelines 🔹 Real-world use cases for GPU-accelerated Spark 🔴 Join us live and bring your questions! 🎤 Erik Ordentlich, German "Gera" Shegalov (NVIDIA), Jules Damji (Databricks) RSVP here 👉 https://luma.com/0zffq605 #opensource #sparkconnect #nvidia #apachespark #oss

Apache Spark

21,856 followers
2w

Join us on October 29 at 9:30 AM PT for a deep dive into Apache Spark™, Spark Connect, and Spark ML at NVIDIA! We’ll explore how Spark Connect, starting in Spark 3.4 and extended to MLlib in Spark 4.0+, enables new client connections and stability benefits. ✅ The session will also highlight how NVIDIA GPU-accelerated plugins for Spark SQL and ML deliver end-to-end acceleration with no code changes—with performance up to 9x faster at 80% lower cost. 📅 October 29 ⏰ 9:30–10:30 AM P 📍 Live online (LinkedIn, X & YouTube) RSVP here 👉https://luma.com/0zffq605 🎤 Speakers: Erik Ordentlich & German "Gera" Shegalov, hosted by Jules Damji #opensource #sparkconnect #nvidia #apachespark #oss

Spark Connect: NVIDIA Accelerator for Spark SQL and MLlib

www.linkedin.com

Like Comment Share
Apache Spark

21,856 followers
2w
Report this post
In the Bay Area? 🎉 Join us November 13, 5–6:30 PM at the Databricks Mountain View Office for the Apache Spark Happy Hour! Meet contributors, committers & maintainers, enjoy bites, drinks + swag, and connect with the Spark community. 🙌 The happy hour kicks off right after the Open Lakehouse + AI Mini Summit, making it the perfect way to keep the conversations going. ➡️ Register here: https://luma.com/7or3ih6m We hope to see you there! #opensource #apachspark #oss #openlakehouse
Like Comment Share
Apache Spark

21,856 followers
2w
Report this post
Join us on October 29 at 9:30 AM PT for a deep dive into Apache Spark™, Spark Connect, and Spark ML at NVIDIA! We’ll explore how Spark Connect, starting in Spark 3.4 and extended to MLlib in Spark 4.0+, enables new client connections and stability benefits. ✅ The session will also highlight how NVIDIA GPU-accelerated plugins for Spark SQL and ML deliver end-to-end acceleration with no code changes—with performance up to 9x faster at 80% lower cost. 📅 October 29 ⏰ 9:30–10:30 AM P 📍 Live online (LinkedIn, X & YouTube) RSVP here 👉https://luma.com/0zffq605 🎤 Speakers: Erik Ordentlich & German "Gera" Shegalov, hosted by Jules Damji #opensource #sparkconnect #nvidia #apachespark #oss

Spark Connect: NVIDIA Accelerator for Spark SQL and MLlib

www.linkedin.com

1 Comment

Like Comment Share
Apache Spark

21,856 followers
2mo
Report this post
Join us on September 25 for our webinar: 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸™ 𝗮𝗻𝗱 𝗟𝗮𝗻𝗰𝗲 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿! 🚀 In this talk, Jack Ye (LanceDB) will share how the Lance Spark Connector brings Lance’s AI-native multimodal storage to Apache Spark™. We’ll cover how Spark can work efficiently with embeddings, images, videos, and documents using Lance’s random access, indexing, and vector/blob support. We’ll also explore integration with Hive Metastore, Unity Catalog, and more—plus real-world workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—all on the same dataset, without extra format conversions. 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online #apachespark #lancedb #opensource #oss Jasmine Wang

Apache Spark™ and Lance Spark Connector

www.linkedin.com

15 Comments

Like Comment Share
Apache Spark

21,856 followers
3w
Report this post
The countdown is on—just two days until our webinar this Thursday, Sept 25! 🚀 In this session, Jack Ye from LanceDB will share how the Lance Spark Connector brings Lance’s AI-native multimodal storage to Apache Spark™! 👏 The session will show how Lance’s random access, indexing, and vector/blob support can improve performance and simplify workflows. Topics include: 🔹 Using Lance with Spark for multimodal data (embeddings, images, video, documents) 🔹 Integration with Hive Metastore and @Unity Catalog 🔹 Examples of workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍Reserve your spot: https://luma.com/76o36xuk #opensource #apachespark #spark #oss #lancedb #lance Jasmine Wang Jules Damji

Apache Spark

21,856 followers
2mo

Join us on September 25 for our webinar: 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸™ 𝗮𝗻𝗱 𝗟𝗮𝗻𝗰𝗲 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿! 🚀 In this talk, Jack Ye (LanceDB) will share how the Lance Spark Connector brings Lance’s AI-native multimodal storage to Apache Spark™. We’ll cover how Spark can work efficiently with embeddings, images, videos, and documents using Lance’s random access, indexing, and vector/blob support. We’ll also explore integration with Hive Metastore, Unity Catalog, and more—plus real-world workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—all on the same dataset, without extra format conversions. 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online #apachespark #lancedb #opensource #oss Jasmine Wang

Apache Spark™ and Lance Spark Connector

www.linkedin.com

Like Comment Share
Apache Spark

21,856 followers
1mo
Report this post
One week away! 🚨 Join us for 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸™ 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗟𝗮𝗻𝗰𝗲 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿 with Jack Ye (LanceDB) and Jules Damji on September 25! The session will cover how the Lance Spark Connector enables Apache Spark™ to work with Lance’s AI-native multimodal storage, including handling embeddings, images, videos, and documents with random access, indexing, and vector/blob support. Additionally, the webinar will dive into integrations with Hive Metastore, Unity Catalog, and examples of workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—using one dataset, without format conversions. 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online 🔗 RSVP: https://luma.com/76o36xuk #apachespark #spark #oss #opensource #lancedb #lance #sparkconnector

Apache Spark

21,856 followers
2mo

Join us on September 25 for our webinar: 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸™ 𝗮𝗻𝗱 𝗟𝗮𝗻𝗰𝗲 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿! 🚀 In this talk, Jack Ye (LanceDB) will share how the Lance Spark Connector brings Lance’s AI-native multimodal storage to Apache Spark™. We’ll cover how Spark can work efficiently with embeddings, images, videos, and documents using Lance’s random access, indexing, and vector/blob support. We’ll also explore integration with Hive Metastore, Unity Catalog, and more—plus real-world workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—all on the same dataset, without extra format conversions. 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online #apachespark #lancedb #opensource #oss Jasmine Wang

Apache Spark™ and Lance Spark Connector

www.linkedin.com

Like Comment Share
Apache Spark

21,856 followers
1mo
Report this post
Join us for our webinar on Apache Spark™ and Lance Spark Connector with Jack Ye (LanceDB) on September 25! 👏 Learn how the Lance Spark Connector enables Apache Spark™ to work with Lance’s AI-native multimodal storage. ✅ We’ll look at how Spark can handle embeddings, images, videos, and documents with random access, indexing, and vector/blob support. We'll also cover integration with Hive Metastore, Unity Catalog, and examples of workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—using one dataset, without format conversions. 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online #apachespark #spark #oss #opensource #lancedb #lance #sparkconnector Jules Damji

Apache Spark

21,856 followers
2mo

Join us on September 25 for our webinar: 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸™ 𝗮𝗻𝗱 𝗟𝗮𝗻𝗰𝗲 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿! 🚀 In this talk, Jack Ye (LanceDB) will share how the Lance Spark Connector brings Lance’s AI-native multimodal storage to Apache Spark™. We’ll cover how Spark can work efficiently with embeddings, images, videos, and documents using Lance’s random access, indexing, and vector/blob support. We’ll also explore integration with Hive Metastore, Unity Catalog, and more—plus real-world workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—all on the same dataset, without extra format conversions. 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online #apachespark #lancedb #opensource #oss Jasmine Wang

Apache Spark™ and Lance Spark Connector

www.linkedin.com

1 Comment

Like Comment Share
Apache Spark

21,856 followers
1mo
Report this post
Jules Damji

DevRel, Developer Education, Data Engineering & Distributed Computing. x-Hortonworks; x-Anyscale; O’Reilly Author ✍️ ; currently freelancing …
1mo Edited

Matei Zaharia gave a keynote address at #VLDB today (https://vldb.org/2025/) titled ”Bringing the Operational and Analytical Worlds Together with Lakebase.” Apache Spark™ is a central engine to the combined worlds of OLTP and Lakehouse.
Like Comment Share
Apache Spark

21,856 followers
3mo
Report this post
📣 Announcing: Apache Spark™ Python Data Source for Hugging Face AI Datasets! During this virtual event, you’ll learn how Apache Spark™ 4.x Python Data Source API allows Hugging Face to extend datasets for AI workloads. Why attend? ✅ 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿 𝘁𝗵𝗲 𝗹𝗮𝘁𝗲𝘀𝘁 𝗦𝗽𝗮𝗿𝗸 𝟰.𝘅 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀, 𝗶𝗻𝗰𝗹𝘂𝗱𝗶𝗻𝗴 𝗔𝗿𝗿𝗼𝘄 𝘀𝘂𝗽𝗽𝗼𝗿𝘁 Learn newest capabilities in Apache Spark 4.x, designed to boost performance for AI and data engineering workloads. ✅ 𝗟𝗲𝗮𝗿𝗻 𝗵𝗼𝘄 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲’𝘀 𝗻𝗲𝘄 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲 𝗲𝗻𝗮𝗯𝗹𝗲𝘀 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗔𝗜 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀 See how the integration with Hugging Face streamlines loading, processing, and sharing of AI datasets. ✅ 𝗔𝘀𝗸 𝘆𝗼𝘂𝗿 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗴𝗲𝘁 𝗮𝗻𝘀𝘄𝗲𝗿𝘀 𝗶𝗻 𝗿𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗱𝘂𝗿𝗶𝗻𝗴 𝗮 𝗹𝗶𝘃𝗲 𝗤&𝗔 🗓️ Wednesday, August 27 🕔 9:30AM PST 📍Online Reserve your spot today ➡️ https://lu.ma/jfvpmqiy #opensource #oss #apachespark #huggingface #ai Quentin Lhoest Jules Damji

Apache Spark™ Python Data Source for Hugging Face AI Datasets

www.linkedin.com

17 Comments

Like Comment Share

Apache Spark

Technology, Information and Internet

Berkeley, CA 21,856 followers

Unified engine for large-scale data analytics

About us

Locations

Employees at Apache Spark

Hyukjin Kwon

ASF member, Apache Spark PMC member and committer. I am NOT interested in switching companies. I believe this company has great potential for…

Calili dos Santos Silva

Data Engineer @ XP Inc. | Data Science | Machine Learning Engineer | Python | SQL | R | Spark | Databricks | DevOps

Arul Gampala

Apache at Apache Spark

Syed Zia

Student at Osmania University, Hyderabad

Updates

Spark Connect: NVIDIA Accelerator for Spark SQL and MLlib

www.linkedin.com

Spark Connect: NVIDIA Accelerator for Spark SQL and MLlib

www.linkedin.com

Spark Connect: NVIDIA Accelerator for Spark SQL and MLlib

www.linkedin.com

Apache Spark™ and Lance Spark Connector

www.linkedin.com

Apache Spark™ and Lance Spark Connector

www.linkedin.com

Apache Spark™ and Lance Spark Connector

www.linkedin.com

Apache Spark™ and Lance Spark Connector

www.linkedin.com

Apache Spark™ Python Data Source for Hugging Face AI Datasets

www.linkedin.com

Join now to see what you are missing

Similar pages

Delta Lake

Databricks

Apache Airflow

Apache Iceberg

Unity Catalog

Snowflake

The Apache Software Foundation

dbt Labs

MLflow

Kafka

Browse jobs

Engineer jobs

Data Engineer jobs

Developer jobs

Digital Marketing Executive jobs

Social Media Marketing Specialist jobs

Analyst jobs

Senior Analyst jobs

Data Analyst jobs

Logistics Analyst jobs

Professor jobs

Lead jobs

Logistics Specialist jobs

Financial Analyst jobs

Operator jobs

Teacher jobs

Software Engineer jobs

Assistant jobs

Intern jobs

Quantitative Analyst jobs

Data Science Specialist jobs