Role Proficiency:
This role requires proficiency in data pipeline development including coding and testing data pipelines for ingesting wrangling transforming and joining data from various sources. Must be adept at using ETL tools such as Informatica Glue Databricks and DataProc with coding skills in Python PySpark and SQL. Works independently and demonstrates proficiency in at least one domain related to data with a solid understanding of SCD concepts and data warehousing principles.
Outcomes:
Collaborate closely with data analysts data scientists and other stakeholders to ensure data accessibility quality and security across various data sources.rnDesign develop and maintain data pipelines that collect process and transform large volumes of data from various sources. Implement ETL (Extract Transform Load) processes to facilitate efficient data movement and transformation. Integrate data from multiple sources including databases APIs cloud services and third-party data providers. Establish data quality checks and validation procedures to ensure data accuracy completeness and consistency. Develop and manage data storage solutions including relational databases NoSQL databases and data lakes. Stay updated on the latest trends and best practices in data engineering cloud technologies and big data tools.Measures of Outcomes:
Adherence to engineering processes and standards Adherence to schedule / timelines Adhere to SLAs where applicable # of defects post delivery # of non-compliance issues Reduction of reoccurrence of known defects Quickly turnaround production bugs Completion of applicable technical/domain certifications Completion of all mandatory training requirementst Efficiency improvements in data pipelines (e.g. reduced resource consumption faster run times). Average time to detect respond to and resolve pipeline failures or data issues.Outputs Expected:
Code Development:
Develop data processing code independentlyensuring it meets performance and scalability requirements.
Documentation:
including source-target mappings
test cases
and results.
Configuration:
Testing:
Domain Relevance:
such as EDI formats.
Defect Management:
fix
and retest defects in accordance with project standards.
Estimation:
effort
and resource dependencies for personal work.
Knowledge Management:
SharePoint
libraries
and client universities.
Design Understanding:
Certifications:
Skill Examples:
Proficiency in SQL Python or other programming languages utilized for data manipulation. Experience with ETL tools such as Apache Airflow Talend Informatica AWS Glue Dataproc and Azure ADF. Hands-on experience with cloud platforms like AWS Azure or Google Cloud particularly with data-related services (e.g. AWS Glue BigQuery). Conduct tests on data pipelines and evaluate results against data quality and performance specifications. Experience in performance tuning data processes. Proficiency in querying data warehouses.Knowledge Examples:
Knowledge Examples
Knowledge of various ETL services provided by cloud providers including Apache PySpark AWS Glue GCP DataProc/DataFlow and Azure ADF/ADLF. Understanding of data warehousing principles and practices. Proficiency in SQL for analytics including windowing functions. Familiarity with data schemas and models. Understanding of domain-related data and its implications.Additional Comments:
Tech skills • Proficient in Python (Including popular python packages e.g. Pandas, NumPy etc.) and SQL • Strong background in distributed data processing and storage (e.g. Apache Spark, Hadoop) • Large scale (TBs of data) data engineering skills - Model data, create production ready ETL pipelines • Development experience with at least one cloud (Azure high preference, AWS, GCP) • Knowledge of data lake and data lake house patterns • Knowledge of ETL performance tuning and cost optimization • Knowledge of data structures and algorithms and good software engineering practices Soft skills • Strong communication skills to articulate complex situation concisely • Comfortable with picking up new technologies independently • Eye for detail, good data intuition, and a passion for data quality • Comfortable working in a rapidly changing environment with ambiguous requirements