
🔵 Batch Processing Frameworks
| Tool Name | Category/Type | Key Features | Pricing | Link | 
|---|---|---|---|---|
| Apache Hadoop | Batch Processing | Distributed storage (HDFS), scalable MapReduce framework, data redundancy and fault-tolerant | Free (Open-source) | hadoop.apache.org | 
| Apache Spark | Batch + Stream | In-memory processing, fast performance, supports MLlib, GraphX, and structured data with Spark SQL | Free (Open-source) | spark.apache.org | 
| Google Dataflow | Cloud Batch/Stream | Unified model for batch/stream, Apache Beam support, autoscaling, real-time data processing | Pay-as-you-go | cloud.google.com/dataflow | 
| Azure Synapse Analytics | Cloud Data Processing | Integrated data warehousing and big data, Spark integration, analytics, and machine learning | Pay-as-you-go | azure.microsoft.com | 
🟢 Stream Processing Tools
| Tool Name | Category/Type | Key Features | Pricing | Link | 
|---|---|---|---|---|
| Apache Kafka | Distributed Streaming | Distributed publish-subscribe messaging system, real-time stream processing | Free (Open-source) | kafka.apache.org | 
| Apache Flink | Stream + Batch | High-throughput, low-latency stream processing, event time semantics, stateful computation | Free (Open-source) | flink.apache.org | 
| Amazon Kinesis | Stream Processing | Real-time data streaming, video/audio streams, data ingestion for analytics, AI/ML pipelines | Pay-as-you-go | aws.amazon.com/kinesis | 
| Redpanda | Kafka API Compatible | Low-latency, Kafka replacement, optimized for real-time streaming workloads | Custom Pricing | redpanda.com | 
🟣 Data Storage & Query Engines
| Tool Name | Category/Type | Key Features | Pricing | Link | 
|---|---|---|---|---|
| HDFS (Hadoop Distributed File System) | Distributed Storage | Scalable, fault-tolerant storage system for Hadoop ecosystem | Free (Open-source) | hadoop.apache.org | 
| Amazon S3 | Object Storage | Scalable, secure, and cost-effective cloud object storage, integrated with AWS analytics | Pay-as-you-go | aws.amazon.com/s3 | 
| Presto (Trino) | Distributed SQL | Distributed SQL query engine for big data, supports a wide variety of data sources | Free (Open-source) | trino.io | 
| Apache Hive | Data Warehouse | SQL-like querying on Hadoop, supports batch processing and big data analysis | Free (Open-source) | hive.apache.org | 
| ClickHouse | Columnar Database | High-speed OLAP database management system for real-time analytics | Free (Open-source) | clickhouse.com | 
🟡 Machine Learning & Advanced Analytics on Big Data
| Tool Name | Category/Type | Key Features | Pricing | Link | 
|---|---|---|---|---|
| MLlib (Spark MLlib) | Machine Learning | Distributed machine learning library in Apache Spark, supports classification, regression | Free (Open-source) | spark.apache.org/mllib | 
| H2O.ai | AutoML on Big Data | Scalable machine learning and deep learning platform, AutoML, Spark integration | Free + Enterprise pricing | h2o.ai | 
| DataRobot | AutoML Platform | End-to-end automation for machine learning on big datasets, model deployment and monitoring | Custom Pricing | datarobot.com | 
🟤 Data Orchestration & Workflow Tools for Big Data
| Tool Name | Category/Type | Key Features | Pricing | Link | 
|---|---|---|---|---|
| Apache Airflow | Workflow Orchestration | Programmatically author workflows as DAGs, schedule and monitor big data pipelines | Free (Open-source) | airflow.apache.org | 
| Prefect | Data Orchestration | Python-native, data flow orchestration, event-based triggers, parallel execution | Free + Paid plans | prefect.io | 
| Dagster | Orchestrator | Type-safe data orchestrator, designed for data engineering workflows | Free + Enterprise plans | dagster.io | 
🔴 Big Data Cloud Platforms (Managed Services)
| Tool Name | Category/Type | Key Features | Pricing | Link | 
|---|---|---|---|---|
| Google BigQuery | Serverless Data Warehouse | Fully managed data warehouse, ANSI SQL support, real-time analytics, seamless scaling | Pay-per-use | cloud.google.com/bigquery | 
| Amazon Redshift | Data Warehouse | Fast, scalable data warehouse, machine learning integration, columnar storage | Starts at $0.25/hour | aws.amazon.com/redshift | 
| Snowflake | Cloud Data Platform | Elastic compute and storage, cross-cloud deployment, built-in security, data sharing | Pay-as-you-go | snowflake.com | 
| Databricks | Unified Analytics | Apache Spark-based unified data analytics platform, machine learning, lakehouse architecture | Pay-as-you-go | databricks.com | 
✅ Categories Recap
| Category | Description | 
|---|---|
| Batch Processing Frameworks | Handle large-scale data processing in batches (Hadoop, Spark) | 
| Stream Processing Tools | Real-time processing of data streams for fast analytics (Kafka, Flink) | 
| Data Storage & Query Engines | Storage & fast querying solutions for massive data (HDFS, Presto, ClickHouse) | 
| ML & Advanced Analytics on Big Data | Machine learning and AI tools optimized for large-scale data analytics (MLlib, H2O.ai) | 
| Workflow Orchestration | Automation and orchestration for complex big data pipelines (Airflow, Prefect) | 
| Big Data Cloud Platforms | Fully managed platforms offering storage, processing, and analytics at scale (BigQuery, Snowflake) | 
 
 