1. SQL and Databases
2.Python for Data Engineering
3.Data Modeling and Warehousing
4.Version Control and CI/CD (Basic Awareness)

Question

Anonymous · Accepted Answer

INNER JOIN vs LEFT JOIN: INNER JOIN returns matching rows; LEFT JOIN returns all left table rows and matching right rows.

Find duplicates: Use GROUP BY with HAVING COUNT(*) > 1.

Normalization: Process of reducing data redundancy by organizing data into related tables.

Second highest salary: Use MAX(salary) with WHERE salary < (SELECT MAX(salary)).

Index: A database object that speeds up data retrieval.

2. Python for Data Engineering
Read large CSV: Use pandas.read_csv() with chunksize for memory efficiency.

List vs Tuple: Lists are mutable; tuples are immutable.

Handle missing values: Use dropna() or fillna() in pandas.

pandas vs numpy: pandas is for tabular data; numpy is for numerical arrays.

Simple pipeline: Define extract(), transform(), load() functions in Python.

3. Data Modeling and Warehousing
Star vs Snowflake schema: Star has denormalized dimensions; snowflake has normalized ones.

Fact vs Dimension table: Fact tables store measurable events; dimension tables store descriptive attributes.

SCD types: SCD Type 1 overwrites data; Type 2 keeps history.

Surrogate key: A system-generated unique ID used instead of natural keys.

Why data modeling: Ensures consistency, scalability, and performance in data systems.

4. Version Control and CI/CD (Basic)
What is Git: A distributed version control system to track code changes.

Why use Git: To collaborate and manage changes in codebases effectively.

requirements.txt: Lists Python packages needed for a project.

What is CI/CD: Continuous Integration/Continuous Deployment automates testing and deployment.

Why CI/CD matters: Ensures reliable, fast, and automated delivery of code to production.

ANZ