Job Summary
Synechron is seeking a highly experienced PySpark Data Engineer to develop, optimize, and maintain scalable data pipelines within the Cloudera Data Platform (CDP). This role is essential in ensuring high data quality, availability, and performance across enterprise data ecosystems. The successful candidate will leverage extensive big data and cloud-native processing expertise to support business analytics, reporting, and data science initiatives, driving impactful insights and operational efficiency.
Software Requirements
Required:
Advanced proficiency in PySpark, including handling DataFrames, RDDs, and optimization techniques for large-scale data processing
Strong experience with Cloudera Data Platform components such as Cloudera Manager, Hive, Impala, HDFS, and HBase
In-depth knowledge of Hadoop ecosystem technologies (Hadoop, Kafka) and distributed computing frameworks
SQL expertise and experience with data warehousing concepts (Hive, Impala)
Linux scripting skills (Bash, Python) for automation and operational workflows
Experience with orchestration tools like Apache Oozie or Apache Airflow
Preferred:
Cloud data services (AWS EMR, Azure HDInsight, GCP Dataproc) for scalable data processing
Data modeling, metadata management, and data governance tools
CI/CD pipelines setup using Jenkins, GitLab, or similar tools
Overall Responsibilities
Design, develop, and optimize highly scalable data pipelines using PySpark within the Cloudera Data Platform to support business intelligence and analytics.
Manage end-to-end data ingestion processes from various sources such as relational databases, APIs, and file systems.
Execute data transformation, cleansing, and aggregation processes on large datasets to facilitate reporting and data science activities.
Conduct performance tuning of PySpark jobs and optimize cluster resource utilization.
Implement data quality checks, validation routines, and monitoring to ensure data accuracy and consistency.
Automate data workflows and pipeline orchestration to reduce manual intervention and improve efficiency.
Troubleshoot data pipeline issues and drive operational stability across data ecosystems.
Collaborate with data analysts, data scientists, and platform engineers to understand data requirements and improve system performance.
Maintain detailed documentation for data pipelines, workflows, configurations, and operational procedures.
Support data governance, security, and compliance initiatives aligned with enterprise standards.
Technical Skills (By Category)
Programming & Data Processing (Essential):
PySpark (DataFrames, RDDs, optimization)
SQL (Hive, Impala, relational databases)
Linux scripting (Bash, Python) for automation
Data Ecosystem & Storage (Essential):
Hadoop ecosystem (HDFS, Hive, Impala, HBase)
Kafka or similar messaging systems for data streaming
Cloud & Orchestration (Preferred):
Cloud-native data processing (AWS EMR, Azure HDInsight, GCP Dataproc)
Orchestration tools (Apache Airflow, Oozie)
Tools & Frameworks (Preferred):
CI/CD with Jenkins, GitLab CI
Data governance and metadata tools (e.g., Apache Atlas, Collibra)
Experience Requirements
Minimum of 5+ years working in data engineering roles with significant PySpark expertise.
Proven experience building and managing large-scale data pipelines in enterprise environments.
Strong background in big data ecosystems, cloud data services, and data warehousing.
Demonstrated ability to optimize Spark jobs and troubleshoot distributed data processing issues.
Experience supporting financial or regulated industries is advantageous.
Support pathways include extensive hands-on experience in large data ecosystems supporting analytics and reporting.
Day-to-Day Activities
Develop, optimize, and monitor scalable data pipelines for ingestion, transformation, and redistribution of data.
Troubleshoot data processing issues proactively, perform root cause analysis, and implement fixes.
Collaborate with data analysts, data scientists, and platform teams to design data models and pipelines based on business needs.
Automate operational workflows using orchestration tools to enhance pipeline reliability.
Conduct performance tuning, cluster management, and resource optimization for Spark jobs.
Validate data quality, correctness, and completeness through routine reviews and monitoring.
Document architecture, workflows, and procedures for operational governance.
Support data privacy, security, and compliance measures within data ecosystems.
Qualifications
Bachelor’s or Master’s degree in Computer Science, Data Engineering, or related field.
5+ years of hands-on experience with PySpark, big data ecosystems, and distributed processing.
Proven expertise supporting large-scale data pipelines in enterprise or financial industry environments.
Experience with Cloudera Data Platform components (Hive, Impala, HDFS, HBase).
Strong SQL and data modeling skills.
Support experience supporting cloud data processing environments (AWS, Azure, GCP) is advantageous.
Relevant certifications (e.g., AWS Big Data Specialty, Cloudera Certified Data Engineer) are preferred.
Professional Competencies
Strong analytical and troubleshooting skills for complex data pipeline issues.
Ability to work independently and collaboratively across teams.
Effective communication skills to convey technical details to non-technical stakeholders.
Adaptability to evolving technologies and data processing requirements.
Focus on operational excellence, data quality, and process automation.
Ownership mindset to ensure data integrity, performance, and reliability.
SYNECHRON’S DIVERSITY & INCLUSION STATEMENT
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.
All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.
Candidate Application Notice

