30 real-time Data Analyst Interview Questions and Answers – basic and advanced -2025

Here are 50 real-time interview questions and answers for Data Analysts, divided into Basic and Advanced categories. The focus is on SQL and covers data collection, data cleaning, filtering, handling missing values, and preparing datasets for analysis.


Basic Data Analyst Interview Questions

1. What is data collection, and why is it important?

Answer:
Data collection is the process of gathering information from various sources to be used for analysis. It is crucial because high-quality, relevant, and accurate data ensures reliable insights and decision-making.

2. What are different sources of data collection?

Answer:
Common data sources include:

  • Structured data: Databases (SQL, NoSQL)
  • Unstructured data: Logs, JSON, CSV files
  • APIs: RESTful, SOAP APIs
  • Web Scraping: Python (BeautifulSoup, Scrapy)
  • Manual Entry: Surveys, spreadsheets

3. What is data cleaning?

Answer:
Data cleaning involves identifying and correcting (or removing) inaccurate, incomplete, or inconsistent data in a dataset to improve data quality before analysis.

4. What are common data quality issues?

Answer:

  • Missing values
  • Duplicate records
  • Inconsistent formatting
  • Outliers
  • Incorrect data types

5. How do you handle duplicate records in SQL?

Answer:

DELETE FROM sales_data 
WHERE id NOT IN (
    SELECT MIN(id) FROM sales_data 
    GROUP BY customer_name, purchase_date
);

This deletes duplicate rows while keeping the first occurrence.

6. How do you find and remove missing values in SQL?

Answer:
Finding missing values:

SELECT * FROM customers WHERE email IS NULL;

Removing rows with missing values:

DELETE FROM customers WHERE email IS NULL;

7. How can you replace missing values in SQL?

Answer:
Using COALESCE() to replace NULLs with a default value:

SELECT id, COALESCE(salary, 0) AS salary FROM employees;

8. How do you standardize date formats in SQL?

Answer:

SELECT TO_DATE(date_column, 'YYYY-MM-DD') FROM orders;

9. How do you filter data based on conditions in SQL?

Answer:

SELECT * FROM employees WHERE department = 'Sales' AND salary > 50000;

10. How do you sort data in SQL?

Answer:

SELECT * FROM employees ORDER BY salary DESC;

11. What is the difference between INNER JOIN and LEFT JOIN?

Answer:

  • INNER JOIN: Returns only matching rows from both tables.
  • LEFT JOIN: Returns all rows from the left table, even if there are no matches in the right table.

12. How do you find NULL values after performing a LEFT JOIN?

Answer:

SELECT a.*, b.order_id
FROM customers a
LEFT JOIN orders b ON a.customer_id = b.customer_id
WHERE b.order_id IS NULL;

13. How do you remove special characters from a string in SQL?

Answer:

SELECT REGEXP_REPLACE(name, '[^a-zA-Z0-9 ]', '') FROM customers;

14. How do you detect and remove outliers in SQL?

Answer:
Using Interquartile Range (IQR):

SELECT * FROM sales 
WHERE amount < (SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) FROM sales)
   OR amount > (SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) FROM sales);

15. How do you standardize text case in SQL?

Answer:

SELECT LOWER(customer_name) FROM customers;

16. How do you convert data types in SQL?

Answer:

SELECT CAST(salary AS FLOAT) FROM employees;

17. How do you calculate the percentage of missing values in SQL?

Answer:

SELECT 
    (COUNT(*) - COUNT(email)) * 100.0 / COUNT(*) AS missing_percentage
FROM customers;

18. How do you merge datasets in SQL?

Answer:

SELECT * FROM sales s
JOIN customers c ON s.customer_id = c.customer_id;

19. How do you replace empty strings with NULL in SQL?

Answer:

UPDATE employees 
SET phone_number = NULL 
WHERE phone_number = '';

20. How do you remove trailing spaces from a column in SQL?

Answer:

SELECT TRIM(name) FROM customers;

Advanced Data Analyst Interview Questions

21. How do you handle inconsistent categorical values in SQL?

Answer:
Using CASE:

SELECT 
    CASE 
        WHEN gender IN ('M', 'Male') THEN 'Male'
        WHEN gender IN ('F', 'Female') THEN 'Female'
        ELSE 'Other'
    END AS standardized_gender
FROM customers;

22. How do you check for duplicates across multiple columns in SQL?

Answer:

SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;

23. How do you extract only numeric values from a column in SQL?

Answer:

SELECT REGEXP_REPLACE(phone_number, '[^0-9]', '') FROM customers;

24. How do you detect anomalies using Z-score in SQL?

Answer:

SELECT *, 
       (amount - (SELECT AVG(amount) FROM sales)) / (SELECT STDDEV(amount) FROM sales) AS z_score
FROM sales;

25. How do you fill missing values using the average in SQL?

Answer:

UPDATE employees 
SET salary = (SELECT AVG(salary) FROM employees)
WHERE salary IS NULL;

26. How do you pivot data in SQL?

Answer:

SELECT department, 
       SUM(CASE WHEN gender = 'Male' THEN salary ELSE 0 END) AS male_salary,
       SUM(CASE WHEN gender = 'Female' THEN salary ELSE 0 END) AS female_salary
FROM employees
GROUP BY department;

27. How do you efficiently delete large datasets in SQL?

Answer:
Using batch deletes:

DELETE FROM logs WHERE created_at < NOW() - INTERVAL '1 year' LIMIT 10000;

28. What are CTEs and when should you use them?

Answer:
A Common Table Expression (CTE) improves query readability and allows recursive queries:

WITH recent_sales AS (
    SELECT * FROM sales WHERE purchase_date > '2023-01-01'
)
SELECT * FROM recent_sales;

29. How do you normalize a dataset using SQL?

Answer:

SELECT id, 
       (salary - MIN(salary) OVER()) / (MAX(salary) OVER() - MIN(salary) OVER()) AS normalized_salary
FROM employees;

30. What are the best practices for handling NULL values in SQL?

Answer:

  • Use COALESCE() to replace NULLs
  • Avoid NULLs in primary keys
  • Use default values when creating tables

Few Problems explained Step-by-Step

Question 5: How do you handle duplicate records in SQL?

Understanding the Problem

When working with datasets, sometimes duplicate records (rows with the same values in certain columns) appear. This can cause data inconsistency and inaccurate analysis. We need a way to identify and remove duplicates while keeping only one unique entry.


Step-by-Step Approach to Removing Duplicates in SQL

Step 1: Identify Duplicates

To check for duplicate records, use the GROUP BY clause on columns that should be unique and use COUNT(*) to find duplicates.

SELECT customer_name, purchase_date, COUNT(*) AS duplicate_count
FROM sales_data
GROUP BY customer_name, purchase_date
HAVING COUNT(*) > 1;

🔹 What does this do?

  • Groups the data by customer_name and purchase_date (you can change these columns based on your dataset).
  • Counts how many times each combination appears.
  • Shows only records where the count is greater than 1 (duplicates exist).

Step 2: Remove Duplicates While Keeping One Entry

To remove duplicates while keeping the first occurrence, we use the DELETE statement with a subquery.

DELETE FROM sales_data 
WHERE id NOT IN (
    SELECT MIN(id) 
    FROM sales_data 
    GROUP BY customer_name, purchase_date
);

🔹 Breaking it down:

  1. The subquery: SELECT MIN(id) FROM sales_data GROUP BY customer_name, purchase_date
    • Finds the smallest (earliest) id for each duplicate group.
    • Keeps only one unique row for each customer_name, purchase_date combination.
  2. The DELETE statement: DELETE FROM sales_data WHERE id NOT IN (...)
    • Deletes all rows except the first occurrence (smallest id).

Step 3: Alternative Approach Using ROW_NUMBER() (For Large Datasets)

If your database supports window functions (like MySQL 8+, PostgreSQL, SQL Server, etc.), you can use ROW_NUMBER() to assign a row number to each duplicate and delete all but the first.

WITH duplicates AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_name, purchase_date ORDER BY id) AS row_num
    FROM sales_data
)
DELETE FROM sales_data WHERE id IN (SELECT id FROM duplicates WHERE row_num > 1);

🔹 How it works:

  • The ROW_NUMBER() function assigns a unique number (row_num) to each row within a duplicate group.
  • It keeps only the row with row_num = 1 and deletes all others.

Final Thoughts

If you want to keep one row and remove all other duplicates:

  • Use DELETE with MIN(id).
  • Use ROW_NUMBER() for better efficiency in large datasets.

Question 13: How do you remove special characters from a string in SQL?

Understanding the Problem

When working with text data, sometimes values contain special characters like @, #, $, %, &, *, !, etc. This can create issues during data processing and analysis. We need a way to clean the text by removing all non-alphanumeric characters (keeping only letters and numbers).


Step-by-Step Approach to Removing Special Characters in SQL

Step 1: Identify Special Characters in a Column

To check if a column contains special characters, we can use a pattern-matching query with LIKE:

SELECT name 
FROM customers 
WHERE name LIKE '%[^a-zA-Z0-9 ]%';

🔹 What this does:

  • The pattern [^a-zA-Z0-9 ] searches for any character that is NOT a letter (a-z, A-Z), number (0-9), or space ( ).
  • The % means we are looking for those characters anywhere in the text.
  • This helps identify rows that have special characters.

Step 2: Remove Special Characters Using REGEXP_REPLACE()

Most modern databases (PostgreSQL, MySQL 8+, Oracle, SQL Server) support the REGEXP_REPLACE() function to replace unwanted characters.

SELECT name, REGEXP_REPLACE(name, '[^a-zA-Z0-9 ]', '') AS cleaned_name
FROM customers;

🔹 Breaking it down:

  • REGEXP_REPLACE(name, '[^a-zA-Z0-9 ]', ''):
    • Finds all non-alphanumeric characters.
    • Replaces them with an empty string (”), effectively removing them.
  • The query returns the original name and the cleaned version.

Example:

name (original)cleaned_name (after transformation)
John@Doe!JohnDoe
Jane#SmithJaneSmith
Alice & BobAlice Bob

Step 3: Update the Table to Clean the Data

If we want to permanently update the database and remove special characters, we use UPDATE:

UPDATE customers 
SET name = REGEXP_REPLACE(name, '[^a-zA-Z0-9 ]', '');

🔹 What happens here?

  • Every row in the customers table is updated to remove special characters from the name column.

Alternative for Databases Without REGEXP_REPLACE()

If your SQL database does not support REGEXP_REPLACE(), you may need multiple REPLACE() functions:

SELECT name, 
       REPLACE(REPLACE(REPLACE(name, '@', ''), '#', ''), '!', '') AS cleaned_name
FROM customers;

🔹 Issue with this approach?

  • You must manually specify each special character to remove.
  • It’s not scalable if there are many different special characters.

Final Thoughts

To clean text and remove special characters efficiently:

  • Use REGEXP_REPLACE() in PostgreSQL, MySQL 8+, and Oracle.
  • Use multiple REPLACE() functions if your database does not support regex.
  • Use UPDATE if you want to store the cleaned data permanently.

Question 14: How do you detect and remove outliers in SQL?

Understanding the Problem

Outliers are data points that are significantly different from other observations in a dataset. They can occur due to errors, unusual behavior, or rare events. Detecting and handling outliers is important for accurate data analysis.

One common method to detect outliers is using the Interquartile Range (IQR) method.


Step-by-Step Approach to Detecting and Removing Outliers Using IQR in SQL

Step 1: Understand the IQR Method

The Interquartile Range (IQR) method helps detect outliers based on quartiles:

  • Q1 (25th percentile): The value below which 25% of the data falls.
  • Q3 (75th percentile): The value below which 75% of the data falls.
  • IQR = Q3 – Q1 (Interquartile Range).
  • Lower Bound = Q1 – 1.5 * IQR (Anything below this is an outlier).
  • Upper Bound = Q3 + 1.5 * IQR (Anything above this is an outlier).

🔹 Goal: Find values that are below the lower bound or above the upper bound.


Step 2: Calculate Q1, Q3, and IQR in SQL

To find Q1, Q3, and IQR for a column (e.g., amount in a sales table), we use PERCENTILE_CONT() (in PostgreSQL, SQL Server) or APPROX_PERCENTILE() (in BigQuery).

SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3
FROM sales;

🔹 Breaking it down:

  • PERCENTILE_CONT(0.25): Finds Q1 (25th percentile).
  • PERCENTILE_CONT(0.75): Finds Q3 (75th percentile).

Example Output:

Q1Q3
200800

🔹 Now, calculate IQR:

SELECT 
    (Q3 - Q1) AS IQR
FROM (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3
    FROM sales
) AS subquery;

Example Output:

IQR
600

Step 3: Identify Outliers

Using Q1, Q3, and IQR, calculate the lower and upper bounds:

SELECT * FROM sales
WHERE amount < (SELECT Q1 - 1.5 * IQR FROM (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3,
        (PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) - 
         PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount)) AS IQR
    FROM sales
) AS subquery)
OR amount > (SELECT Q3 + 1.5 * IQR FROM (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3,
        (PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) - 
         PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount)) AS IQR
    FROM sales
) AS subquery);

🔹 What this does:

  • Finds values that are below (Q1 – 1.5 * IQR) or above (Q3 + 1.5 * IQR).
  • Returns all records that are outliers.

Example Output (Outliers):

idamount
150
103000

Step 4: Remove Outliers from the Table

To permanently remove outliers:

DELETE FROM sales
WHERE amount < (SELECT Q1 - 1.5 * IQR FROM (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3,
        (PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) - 
         PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount)) AS IQR
    FROM sales
) AS subquery)
OR amount > (SELECT Q3 + 1.5 * IQR FROM (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3,
        (PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) - 
         PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount)) AS IQR
    FROM sales
) AS subquery);

🔹 This deletes all rows where amount is an outlier.


Alternative Approach Using Z-Score

For databases that do not support PERCENTILE_CONT(), an alternative is to use Z-score to detect outliers:

SELECT *, 
       (amount - (SELECT AVG(amount) FROM sales)) / (SELECT STDDEV(amount) FROM sales) AS z_score
FROM sales
WHERE ABS((amount - (SELECT AVG(amount) FROM sales)) / (SELECT STDDEV(amount) FROM sales)) > 3;

🔹 How this works:

  • Calculates how far each value is from the mean in terms of standard deviation.
  • If |z_score| > 3, the value is considered an outlier.

Final Thoughts

To detect and remove outliers in SQL:

  • Use IQR method if your database supports PERCENTILE_CONT().
  • Use Z-score method if you don’t have percentile functions.
  • Use DELETE to remove outliers permanently.

Question 17: How do you normalize data in SQL?

Understanding the Problem

Normalization is the process of scaling values within a dataset so they fall within a common range (e.g., 0 to 1 or -1 to 1). This is useful when:

  • Comparing values from different columns with different units (e.g., sales in dollars vs. number of transactions).
  • Improving machine learning models that rely on balanced numerical inputs.

The two most common normalization methods are:

  1. Min-Max Scaling (scales values between 0 and 1).
  2. Z-Score Normalization (scales values based on standard deviation).

Step-by-Step Approach to Normalizing Data in SQL

Step 1: Normalize Using Min-Max Scaling

The Min-Max formula is: Normalized Value=(X−Min)(Max−Min)\text{Normalized Value} = \frac{(X – \text{Min})}{(\text{Max} – \text{Min})}

🔹 Goal: Scale values between 0 and 1.
🔹 Formula Breakdown:

  • X = actual value
  • Min = minimum value in the column
  • Max = maximum value in the column

🔹 SQL Query for Min-Max Normalization:

SELECT amount,
       (amount - (SELECT MIN(amount) FROM sales)) / 
       ((SELECT MAX(amount) FROM sales) - (SELECT MIN(amount) FROM sales)) AS normalized_amount
FROM sales;

Example Output (Before & After Normalization):

amountnormalized_amount
1000.00
2500.30
4000.75
5001.00

🔹 Explanation:

  • The lowest value (100) becomes 0.
  • The highest value (500) becomes 1.
  • Other values are scaled between 0 and 1.

Step 2: Normalize Using Z-Score Scaling

The Z-score formula is: Z-Score=(X−μ)σ\text{Z-Score} = \frac{(X – \mu)}{\sigma}

where:

  • X = actual value
  • μ (Mean) = average of the column
  • σ (Standard Deviation) = spread of values

🔹 SQL Query for Z-Score Normalization:

SELECT amount,
       (amount - (SELECT AVG(amount) FROM sales)) / (SELECT STDDEV(amount) FROM sales) AS z_score
FROM sales;

Example Output (Before & After Z-Score Normalization):

amountz_score
100-1.2
250-0.4
4000.5
5001.1

🔹 Explanation:

  • Values are transformed based on how far they are from the mean.
  • Negative values are below the mean, and positive values are above the mean.
  • This method helps when the data follows a normal distribution.

Step 3: Store Normalized Values in a New Column

If we want to save the normalized values permanently in the table:

ALTER TABLE sales ADD COLUMN normalized_amount FLOAT;

UPDATE sales 
SET normalized_amount = (amount - (SELECT MIN(amount) FROM sales)) / 
                        ((SELECT MAX(amount) FROM sales) - (SELECT MIN(amount) FROM sales));

🔹 What this does:

  • Adds a new column (normalized_amount).
  • Fills it with normalized values using Min-Max Scaling.

Final Thoughts

To normalize data in SQL:

  • Use Min-Max Scaling if you want values between 0 and 1.
  • Use Z-Score Normalization if you need mean-centered values.
  • Use UPDATE to store normalized values permanently.

Question 22: How do you join multiple tables in SQL?

Understanding the Problem

When working with databases, data is often stored in multiple tables. To analyze it, we need to combine related tables using SQL JOINs.

Common Types of Joins:

  1. INNER JOIN – Returns matching records from both tables.
  2. LEFT JOIN – Returns all records from the left table, and matching records from the right table (if any).
  3. RIGHT JOIN – Returns all records from the right table, and matching records from the left table (if any).
  4. FULL JOIN – Returns all records from both tables, filling in NULLs where no match exists.

Step-by-Step Approach to Joining Multiple Tables

Scenario:

We have two tables:

  1. Customers Table
    | customer_id | name | city |
    |————|——-|——–|
    | 1 | Alice | NY |
    | 2 | Bob | LA |
    | 3 | John | SF |
  2. Orders Table
    | order_id | customer_id | amount |
    |———|————|——–|
    | 101 | 1 | 500 |
    | 102 | 2 | 300 |
    | 103 | 4 | 200 |

Step 1: Use INNER JOIN (Match Only Common Data)

If we want to get only customers who have placed orders, we use INNER JOIN:

SELECT customers.name, orders.order_id, orders.amount
FROM customers
INNER JOIN orders ON customers.customer_id = orders.customer_id;

Result:

nameorder_idamount
Alice101500
Bob102300

🔹 Explanation:

  • The query joins customers and orders on customer_id.
  • Since customer_id = 3 (John) has no orders, he is not included.
  • Since customer_id = 4 (order 103) has no matching customer, it is not included.

Step 2: Use LEFT JOIN (Include All Customers, Even If No Orders)

If we want to keep all customers, even if they have no orders, we use LEFT JOIN:

SELECT customers.name, orders.order_id, orders.amount
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;

Result:

nameorder_idamount
Alice101500
Bob102300
JohnNULLNULL

🔹 Explanation:

  • All customers are included.
  • John has no orders, so order_id and amount show NULL.

Step 3: Use RIGHT JOIN (Include All Orders, Even If No Customer Exists)

If we want to keep all orders, even if the customer does not exist, we use RIGHT JOIN:

SELECT customers.name, orders.order_id, orders.amount
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;

Result:

nameorder_idamount
Alice101500
Bob102300
NULL103200

🔹 Explanation:

  • All orders are included.
  • The order with customer_id = 4 has no matching customer, so name = NULL.

Step 4: Use FULL JOIN (Include All Customers and All Orders)

If we want all customers and all orders, whether they have matches or not, we use FULL JOIN:

SELECT customers.name, orders.order_id, orders.amount
FROM customers
FULL JOIN orders ON customers.customer_id = orders.customer_id;

Result:

nameorder_idamount
Alice101500
Bob102300
JohnNULLNULL
NULL103200

🔹 Explanation:

  • Includes all customers (even if they have no orders).
  • Includes all orders (even if no customer exists).
  • NULL appears where no match exists.

Step 5: Joining More Than Two Tables

If we have a third table (e.g., payments), we can join multiple tables:

  1. Payments Table
    | payment_id | order_id | payment_status |
    |———–|———|—————|
    | 201 | 101 | Paid |
    | 202 | 102 | Pending |

🔹 SQL Query to Join Three Tables:

SELECT customers.name, orders.order_id, orders.amount, payments.payment_status
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
LEFT JOIN payments ON orders.order_id = payments.order_id;

Result:

nameorder_idamountpayment_status
Alice101500Paid
Bob102300Pending
JohnNULLNULLNULL

🔹 Explanation:

  • Customers without orders show NULL.
  • Orders without payments show NULL in payment_status.

Final Thoughts

To join multiple tables in SQL:

  • Use INNER JOIN if you only want matching records.
  • Use LEFT JOIN if you want all records from the first table.
  • Use RIGHT JOIN if you want all records from the second table.
  • Use FULL JOIN if you want everything from both tables.
  • For 3+ tables, keep joining one table at a time.

Question 24: How do you optimize SQL queries for large datasets?

Understanding the Problem

When working with large datasets, slow queries can impact performance and increase processing time. SQL optimization techniques help retrieve data faster and more efficiently.


Step-by-Step Guide to Optimizing SQL Queries

Step 1: Use Indexes for Faster Searching

🔹 Problem:
When searching for data using WHERE or JOIN, SQL scans the entire table, making queries slow.

🔹 Solution:
Indexes speed up searches by creating a structured lookup system.

🔹 Example: Without Index (Slow Query)

SELECT * FROM sales WHERE customer_id = 123;
  • If customer_id is not indexed, the database scans every row.
  • This is slow for large tables.

🔹 Solution: Create an Index

CREATE INDEX idx_customer_id ON sales(customer_id);
  • Now, SQL can quickly find customer_id = 123 without scanning all rows.

Step 2: Avoid SELECT *** (Select Only Required Columns)

🔹 Problem:
Using SELECT * retrieves all columns, even if you only need a few.

🔹 Solution:
Only select necessary columns to reduce the amount of data being processed.

🔹 Bad Query (Slow)

SELECT * FROM sales;
  • Retrieves all columns, even if we only need customer_id and amount.

🔹 Optimized Query (Faster)

SELECT customer_id, amount FROM sales;
  • Fetches only the required columns, improving performance.

Step 3: Use WHERE Instead of HAVING

🔹 Problem:
HAVING filters results after aggregation, making queries slow.

🔹 Solution:
Use WHERE to filter before aggregation.

🔹 Bad Query (Slow)

SELECT customer_id, SUM(amount)
FROM sales
GROUP BY customer_id
HAVING customer_id = 123;
  • HAVING processes all rows first, then filters after aggregation.

🔹 Optimized Query (Faster)

SELECT customer_id, SUM(amount)
FROM sales
WHERE customer_id = 123
GROUP BY customer_id;
  • WHERE filters before aggregation, reducing data to process.

Step 4: Use JOINS Instead of Subqueries

🔹 Problem:
Subqueries run multiple times, increasing execution time.

🔹 Solution:
Use JOINs instead of subqueries for better performance.

🔹 Bad Query (Slow)

SELECT name, (SELECT COUNT(*) FROM orders WHERE customers.customer_id = orders.customer_id) AS order_count
FROM customers;
  • The subquery runs for every customer, slowing down performance.

🔹 Optimized Query (Faster)

SELECT customers.name, COUNT(orders.order_id) AS order_count
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
GROUP BY customers.name;
  • Uses JOIN instead of a subquery, making it much faster.

Step 5: Use EXPLAIN PLAN to Analyze Queries

🔹 Problem:
You don’t know why a query is slow.

🔹 Solution:
Use EXPLAIN or EXPLAIN ANALYZE to see how SQL executes the query.

🔹 Example:

EXPLAIN ANALYZE SELECT * FROM sales WHERE customer_id = 123;
  • Shows if indexes are used.
  • Identifies slow operations like full table scans.

Step 6: Use LIMIT to Reduce Load

🔹 Problem:
Fetching millions of records at once slows down queries.

🔹 Solution:
Use LIMIT to get only needed rows.

🔹 Bad Query (Slow)

SELECT * FROM sales;
  • Returns all rows, even if only a few are needed.

🔹 Optimized Query (Faster)

SELECT * FROM sales LIMIT 1000;
  • Limits the output to 1000 rows, reducing load.

Step 7: Partition Large Tables

🔹 Problem:
Queries on very large tables take too long.

🔹 Solution:
Partition tables into smaller chunks for faster access.

🔹 Example: Partition a sales table by year:

CREATE TABLE sales_2023 PARTITION OF sales 
FOR VALUES FROM ('2023-01-01') TO ('2023-12-31');
  • Queries on sales_2023 run faster than scanning the full sales table.

Final Thoughts

To optimize SQL queries for large datasets:

  • Use Indexes (CREATE INDEX) for faster searches.
  • Select only required columns instead of SELECT *.
  • Filter with WHERE instead of HAVING.
  • Use JOINS instead of subqueries.
  • Analyze with EXPLAIN PLAN to find bottlenecks.
  • Use LIMIT to reduce data load.
  • Partition tables to improve query speed.

Here’s a step-by-step explanation of questions 26 to 30, related to SQL for Data Analysts.


Question 26: How do you remove duplicate records from a table in SQL?

Understanding the Problem

Duplicate records can occur due to data entry errors, incorrect imports, or multiple joins. Removing duplicates ensures data accuracy.

Step 1: Identify Duplicates

To check for duplicates, use the GROUP BY clause and COUNT(*):

SELECT customer_id, COUNT(*)
FROM sales
GROUP BY customer_id
HAVING COUNT(*) > 1;

✅ This query shows customers appearing more than once.


Step 2: Remove Duplicates Using DELETE with ROW_NUMBER()

We use the ROW_NUMBER() function to assign a unique number to each row.

WITH CTE AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS row_num
    FROM sales
)
DELETE FROM sales WHERE row_num > 1;

Explanation:

  • The CTE (Common Table Expression) assigns a row number to each duplicate.
  • The DELETE statement removes duplicates, keeping only the first occurrence.

Alternative Approach: Use DISTINCT to Remove Duplicates When Selecting Data

If you only need unique results without modifying the table:

SELECT DISTINCT customer_id, order_date, amount
FROM sales;

This does not remove duplicates from the table, but eliminates them in the output.


Question 27: How do you handle NULL values in SQL?

Understanding the Problem

NULL values represent missing or unknown data. Handling them properly is important for accurate analysis.

Step 1: Replace NULL Values Using COALESCE()

If we want to replace NULLs with a default value:

SELECT customer_id, COALESCE(amount, 0) AS amount
FROM sales;

Explanation:

  • COALESCE(amount, 0) replaces NULL values with 0.

Step 2: Use IFNULL() or NVL() (Database-Specific)

  • In MySQL: SELECT customer_id, IFNULL(amount, 0) AS amount FROM sales;
  • In Oracle: SELECT customer_id, NVL(amount, 0) AS amount FROM sales;

Step 3: Filter Out NULL Values

If we want to exclude NULL values, use WHERE:

SELECT * FROM sales WHERE amount IS NOT NULL;

This removes rows where amount is NULL.


Question 28: What is a Common Table Expression (CTE) and when should you use it?

Understanding the Problem

A Common Table Expression (CTE) is a temporary result set used within a SQL query. It makes complex queries more readable and reusable.

Step 1: Define a CTE Using WITH Statement

Example: Get total sales per customer using a CTE:

WITH TotalSales AS (
    SELECT customer_id, SUM(amount) AS total_amount
    FROM sales
    GROUP BY customer_id
)
SELECT * FROM TotalSales;

Explanation:

  • TotalSales is a temporary table holding aggregated results.
  • The SELECT query retrieves results from the CTE.

Step 2: Why Use a CTE?

  • Improves Readability: Breaks complex queries into smaller steps.
  • Reusable: The CTE can be used multiple times in the query.
  • Recursive Queries: CTEs allow self-referencing, useful for hierarchical data.

Example: Recursive CTE to find all employees under a manager:

WITH EmployeeHierarchy AS (
    SELECT employee_id, manager_id, name FROM employees WHERE manager_id = 1
    UNION ALL
    SELECT e.employee_id, e.manager_id, e.name
    FROM employees e
    JOIN EmployeeHierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM EmployeeHierarchy;

Useful for hierarchical structures like reporting managers.


Question 29: What is the difference between UNION and UNION ALL?

Understanding the Problem

Both UNION and UNION ALL combine results from multiple queries, but they work differently.

Step 1: Using UNION (Removes Duplicates)

Example:

SELECT customer_id FROM sales
UNION
SELECT customer_id FROM orders;

Explanation:

  • Combines both tables.
  • Removes duplicate customer_id values.

Step 2: Using UNION ALL (Keeps Duplicates)

Example:

SELECT customer_id FROM sales
UNION ALL
SELECT customer_id FROM orders;

Explanation:

  • Combines both tables.
  • Keeps duplicates (faster performance).

Key Differences

FeatureUNIONUNION ALL
Removes Duplicates?✅ Yes❌ No
PerformanceSlowerFaster
Use CaseWhen you need unique valuesWhen duplicates are allowed

Question 30: How do you use the CASE statement in SQL?

Understanding the Problem

The CASE statement works like an IF-ELSE condition in SQL, allowing conditional logic in queries.


Step 1: Simple CASE Statement Example

If we want to categorize sales based on amount:

SELECT order_id, amount,
    CASE 
        WHEN amount > 500 THEN 'High Value'
        WHEN amount BETWEEN 200 AND 500 THEN 'Medium Value'
        ELSE 'Low Value'
    END AS order_category
FROM sales;

Explanation:

  • If amount > 500, it’s labeled “High Value”.
  • If amount is between 200 and 500, it’s “Medium Value”.
  • Otherwise, it’s “Low Value”.

Step 2: Using CASE in Aggregation

We can count how many orders fall into each category:

SELECT 
    CASE 
        WHEN amount > 500 THEN 'High Value'
        WHEN amount BETWEEN 200 AND 500 THEN 'Medium Value'
        ELSE 'Low Value'
    END AS order_category,
    COUNT(*) AS total_orders
FROM sales
GROUP BY order_category;

Explanation:

  • Groups data into High, Medium, Low Value categories.
  • Counts the number of orders in each category.

Final Thoughts

To summarize:

  • Removing Duplicates → Use ROW_NUMBER() and DELETE or DISTINCT.
  • Handling NULLs → Use COALESCE(), IFNULL(), NVL(), or IS NOT NULL.
  • Using CTEs → Use WITH for better readability and recursive queries.
  • Difference Between UNION and UNION ALLUNION removes duplicates; UNION ALL keeps them.
  • Using CASE Statements → Add conditional logic to queries.

How to Handle Outliers in Data?

Understanding the Problem

Outliers are extreme values that differ significantly from other observations. They can skew analysis, affect machine learning models, and impact business decisions.


Step 1: Identify Outliers in a Dataset

Let’s consider a real-world example: Monthly Sales Data

MonthSales ($)
Jan5,000
Feb5,200
Mar5,100
Apr5,300
May50,000
Jun5,400
Jul5,250
Aug5,150

📌 Notice that May (50,000) is much higher than the rest. This might be an outlier.


Step 2: Detect Outliers Using Statistics

Method 1: Using the Interquartile Range (IQR)

  • IQR Formula IQR = Q3 - Q1 Lower Bound = Q1 - 1.5 * IQR Upper Bound = Q3 + 1.5 * IQR
  • Steps to Find Outliers:
    1. Arrange data in order: 5,000, 5,100, 5,150, 5,200, 5,250, 5,300, 5,400, 50,000
    2. Find Q1 (25th percentile): 5,150
    3. Find Q3 (75th percentile): 5,300
    4. Calculate IQR: 5,300 - 5,150 = 150
    5. Compute bounds:
      • Lower Bound = 5,150 - (1.5 × 150) = 4,925
      • Upper Bound = 5,300 + (1.5 × 150) = 5,525
    6. Find Outliers: Any value outside 4,925 – 5,525 is an outlier.
    7. May (50,000) is an outlier because it is much higher than 5,525.

Method 2: Using Z-Score (Standard Deviation)

  • Formula: Z-score = (Value - Mean) / Standard Deviation
  • If Z-score > 3 or < -3, the data point is an outlier.

📌 Using SQL to Detect Outliers:

SELECT sales, 
       (sales - AVG(sales) OVER ()) / STDDEV(sales) OVER () AS z_score
FROM sales_data;

✅ If z_score > 3, it’s an outlier.


Step 3: Handling Outliers

Once identified, we have three ways to handle them:

Option 1: Remove the Outlier (If It’s an Error)

🔹 If the outlier is a data entry mistake, we can remove it:

DELETE FROM sales_data WHERE sales > 5525;

✅ Use this only if you are sure the outlier is an error.


Option 2: Replace Outliers (Capping/Trimming)

🔹 If the outlier is valid but extreme, we replace it with a reasonable value:

UPDATE sales_data 
SET sales = 5,525 
WHERE sales > 5,525;

This method is useful when outliers may impact analysis but are not errors.


Option 3: Transform Data (Log Transformation)

🔹 If data has a right-skewed distribution (e.g., salaries, prices), we apply a log transformation:

SELECT sales, LOG(sales) AS log_sales FROM sales_data;

✅ This reduces the impact of large outliers.


Final Thoughts

How to Handle Outliers Based on Context?

SituationAction
Data entry mistake (e.g., wrong number)Remove it
Real but extreme value (e.g., a big sale)Cap it to an upper bound
Skewed data (e.g., income, house prices)Use log transformation
Need all values (e.g., fraud detection)Keep the outlier for analysis

  • Top 30 Data Analyst Interview Questions and Answers (2024 Update)
  • Data Analyst Interview Guide: 30 SQL-Based Questions with Answers
  • 30 Must-Know Data Analyst Interview Questions (Basic to Advanced)
  • Data Analyst Interview Preparation: SQL, Data Cleaning & Handling Missing Values
  • Crack Your Data Analyst Interview: Real-World Questions & Answers
  • SQL & Data Analysis Interview Questions: Outliers, Cleaning, and Filtering
  • Data Analyst Interview Questions for Freshers & Experienced (With Examples)
  • Most Common Data Analyst Interview Questions with SQL Code Samples
  • Ace Your Data Analyst Interview: 30 Questions Covering SQL & Data Preparation
  • Data Analyst Interview Questions: Data Collection, Cleaning, and Missing Values

TAGS : DataAnalystInterview #SQLInterviewQuestions #DataScienceJobs #DataAnalysis #SQLForDataAnalysts #DataCleaning #DataPreprocessing #MachineLearning #ETLInterviewQuestions #OutlierDetection #DataFiltering #DataHandling #SQLQueries #DataVisualization #BigData

Similar Posts you may get more info >>