🔥 Apache Spark Interview Questions and Answers (2025) | JaganInfo
Basic Level Questions
What is Apache Spark?
Apache Spark is an open-source unified analytics engine designed for large-scale data processing with built-in modules for streaming, SQL, machine learning, and graph processing.
What are the main features of Apache Spark?
Fast in-memory computation, ease of use, support for multiple languages (Scala, Java, Python, R), advanced analytics, and compatibility with Hadoop ecosystem.
What is an RDD?
Resilient Distributed Dataset (RDD) is the fundamental data abstraction in Spark representing an immutable distributed collection of objects that can be processed in parallel.
What languages does Apache Spark support?
Spark supports Scala, Java, Python (PySpark), and R languages.
What is the Spark driver?
The Spark Driver is the master node that manages the Spark application, schedules tasks, and coordinates worker nodes.
What are transformations and actions in Spark?
Transformations are lazy operations on RDDs producing new RDDs (e.g., map, filter), while actions trigger execution and return results (e.g., count, collect).
What is lazy evaluation in Spark?
Spark delays computation until an action is called, optimizing execution plans and reducing unnecessary computation.
What is the Spark cluster manager?
Spark supports Standalone, Apache Mesos, Hadoop YARN, and Kubernetes cluster managers to allocate resources and schedule tasks.
What is the difference between DataFrame and Dataset?
DataFrames are distributed collections organized into named columns (similar to tables), and Datasets provide statically typed APIs with compile-time type safety, building on DataFrames.
How does Spark handle fault tolerance?
Through lineage information of RDD transformations, Spark can recompute lost partitions upon node failure.
Intermediate Level Questions
What is a Shuffle in Spark?
Shuffle is the process of redistributing data across partitions and nodes, commonly needed in wide transformations such as reduceByKey or join, which can be expensive.
What are broadcast variables in Spark?
Broadcast variables allow the programmer to efficiently send large read-only data to all worker nodes for use in tasks.
What are accumulators in Spark?
Accumulators are variables that workers can only add to, used to implement counters and sums across tasks.
Explain the architecture of Spark.
Spark architecture includes Driver, Cluster Manager, Executors running tasks, and storage layers for efficient distributed processing.
What is Spark SQL?
Spark SQL is a Spark module for structured data processing, supporting SQL queries, DataFrames, and Datasets.
What is Catalyst optimizer?
Catalyst is Spark SQL’s query optimizer that applies advanced techniques to optimize logical and physical query plans.
What storage levels does Spark support?
Spark supports various storage levels for caching RDDs/DataFrames including MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.
How is Spark Streaming different from Apache Storm?
Spark Streaming performs micro-batch based processing while Storm processes real-time data streams record by record.
What is Structured Streaming in Spark?
Structured Streaming provides scalable and fault-tolerant stream processing based on the Spark SQL engine using high-level APIs.
How does Spark integrate with Hadoop?
Spark can run on Hadoop YARN, read data from HDFS, and replace MapReduce jobs to increase performance.
What is DAG Scheduler in Spark?
DAG Scheduler divides operators into stages with shuffle boundaries and submits them as tasks to Task Scheduler for execution.
What is a Tungsten execution engine?
Tungsten is Spark’s low-level execution engine optimizing CPU and memory usage with bytecode generation and cache-aware computation.
What is partitioning in Spark?
Partitioning controls how data is distributed across the cluster and directly impacts parallelism and shuffling.
What is the role of the Task Scheduler?
Schedules tasks on executors based on cluster resources and data locality to optimize execution.
How do you debug Spark applications?
Looking at driver and executor logs, web UI, Spark History Server, and using Spark event logs.
What is the role of the Spark Shell?
An interactive REPL environment to write and test Spark code in Scala or Python.
Explain checkpointing in Spark Streaming.
Checkpointing periodically saves streaming metadata and data to reliable storage to recover from failures.
What is Structured Streaming’s trigger mechanism?
Triggers define when microbatch queries run: continuous, one-time, or fixed intervals.
How is memory management handled in Spark?
Spark uses unified memory management dividing heap into execution and storage memory regions to mitigate GC overhead.
What is Catalyst optimizer’s role in Spark SQL?
It analyzes and optimizes query plans applying rules-based and cost-based transformations for efficient execution.
Advanced Level Questions
Explain Spark’s Catalyst optimizer internals.
Catalyst uses abstract syntax trees, rule-based and cost-based optimization, and code generation for efficient query execution.
How does Spark achieve fault tolerance?
Fault tolerance is achieved by RDD lineage graphs enabling recomputation of lost partitions in case of failures.
Describe the Tungsten project’s significance.
Tungsten improves Spark’s CPU and memory efficiency using whole-stage code generation, cache-aware computation, and off-heap memory management.
How does Spark handle skewed data?
Skewed data can be handled by salting keys, using custom partitioners, broadcasting small tables, and optimizing shuffle operations.
Explain the difference between narrow and wide dependencies in Spark.
Narrow dependencies allow pipelined execution within a stage; wide dependencies involve shuffle between stages and require data redistributions.
How does Spark Streaming handle backpressure?
Backpressure mechanisms throttle data ingestion rates to prevent overload and ensure stability in streaming jobs.
What is the Structured Streaming’s watermark concept?
Watermarks specify how late data is allowed to arrive for event-time windows, helping manage state size and correctness.
How does Spark integrate with machine learning?
Via MLlib, Spark offers scalable machine learning algorithms and pipelines with support for feature extraction, model training, and evaluation.
Explain the execution flow of a Spark job.
Driver creates RDD DAG, DAG Scheduler splits into stages, submits tasks to Task Scheduler, which runs them in executors; results collected and returned.
How do you optimize Spark SQL queries?
Use broadcast joins, cache DataFrames, avoid shuffles, filter early, and leverage Catalyst optimizer capabilities.
What strategies do you use to tune Spark cluster performance?
Tune memory, number of executor cores, parallelism, shuffle partitions, and leverage optimized serialization.
How does Spark’s memory management work?
Spark uses unified memory regions for execution and storage with dynamic adjustment to handle caching and computation efficiently.
Explain the role of cluster managers in Spark.
Cluster managers like YARN, Mesos, and Kubernetes allocate resources and schedule Spark executors across cluster nodes.
What is Catalyst’s rule-based optimization?
It applies logical optimization rules to rearrange and simplify query plans before physical planning.
How does Spark handle streaming checkpointing?
Spark checkpoints metadata and intermediate data to reliable storage to enable recovery and fault tolerance in streaming jobs.
What is the Tungsten’s off-heap memory?
Off-heap memory allows Spark to store data outside JVM heap reducing garbage collection overhead and improving performance.
Explain how Spark handles job scheduling.
Tasks are scheduled in stages considering data locality and available resources by DAG and Task Schedulers.
What is the significance of lineage in Spark?
Lineage provides a deterministic fault recovery by tracking the sequence of transformations to recompute lost data partitions.
How do you handle skewed joins in Spark?
Using techniques like salting keys, broadcasting small datasets, or using skew hints to reduce hot-spotting during join operations.
Explain the differences between Spark SQL and Hive.
Spark SQL supports in-memory processing and interactive queries with Catalyst optimizer, whereas Hive uses MapReduce with higher latency; Spark SQL integrates better with streaming and machine learning.