1. SQL and Databases 2.Python for Data Engineering 3.Data Modeling and Warehousing 4.Version Control and CI/CD (Basic Awareness)
Anonymous
INNER JOIN vs LEFT JOIN: INNER JOIN returns matching rows; LEFT JOIN returns all left table rows and matching right rows. Find duplicates: Use GROUP BY with HAVING COUNT(*) > 1. Normalization: Process of reducing data redundancy by organizing data into related tables. Second highest salary: Use MAX(salary) with WHERE salary < (SELECT MAX(salary)). Index: A database object that speeds up data retrieval. 2. Python for Data Engineering Read large CSV: Use pandas.read_csv() with chunksize for memory efficiency. List vs Tuple: Lists are mutable; tuples are immutable. Handle missing values: Use dropna() or fillna() in pandas. pandas vs numpy: pandas is for tabular data; numpy is for numerical arrays. Simple pipeline: Define extract(), transform(), load() functions in Python. 3. Data Modeling and Warehousing Star vs Snowflake schema: Star has denormalized dimensions; snowflake has normalized ones. Fact vs Dimension table: Fact tables store measurable events; dimension tables store descriptive attributes. SCD types: SCD Type 1 overwrites data; Type 2 keeps history. Surrogate key: A system-generated unique ID used instead of natural keys. Why data modeling: Ensures consistency, scalability, and performance in data systems. 4. Version Control and CI/CD (Basic) What is Git: A distributed version control system to track code changes. Why use Git: To collaborate and manage changes in codebases effectively. requirements.txt: Lists Python packages needed for a project. What is CI/CD: Continuous Integration/Continuous Deployment automates testing and deployment. Why CI/CD matters: Ensures reliable, fast, and automated delivery of code to production.
Check out your Company Bowl for anonymous work chats.