GeekHunter

Data Engineer (BigQuery & MongoDB)

Remoto

(Qualquer lugar)

Sênior

Requisitos

3+ anos de experiência na carreira

Inglês avançado

BigQuery

MongoDB

Conhecimentos Desejáveis

Google Cloud

Airflow

Tarefas e Responsabilidades

Required Qualifications:

1. Bachelor's degree in Computer Science, Information Systems, or a related field.

2. Proven experience as a Data Engineer or similar role, with a strong background in designing and maintaining data pipelines.

Proficiency in SQL and working with cloud-based data warehouses (BigQuery experience preferred).

3. Experience with MongoDB for NoSQL database management.

4. Strong experience with Docker for containerization, along with orchestration tools such as Kubernetes (preferred).

5. Familiarity with workflow orchestration tools such as Apache Airflow or Prefect.

6. Experience with Google Cloud Platform (GCP) services (Cloud Functions, Cloud Run, BigQuery, Dataflow).

7. Strong programming skills in Python or other data-related programming languages.

8. Experience working with large datasets, and data integration from multiple sources.

9. Knowledge of data modeling, data governance, and data architecture best practices.

10. Familiarity with CI/CD pipelines for automated deployment of data infrastructure.

11 Fluent English is a must

12. Fluent Spanish and is a plus

Preferred Qualifications:

Experience with data processing from IoT devices or sensors, particularly spectrometry or other scientific instrumentation.

Familiarity with data visualization tools like Tableau or Google Looker Studio.

Experience managing data in cloud environments like GCP, AWS, or Azure.

Previous experience in a startup environment, particularly in the agriculture or coffee industry, is a plus.

What We Offer:

Opportunity to work with a passionate team in a rapidly growing startup at the intersection of data science and agriculture.

Flexible work environment, with remote work options.

Competitive salary and benefits.

Opportunities for career growth and development.

Key Responsibilities:

Data Pipeline Development: Design, build, and maintain robust ETL pipelines to collect, process, and store data from various sources, including spectrometry data from our NIR scans, user feedback, and external data inputs.

Data Integration: Collaborate with cross-functional teams to integrate data from multiple sources, including MongoDB, BigQuery, APIs, and Excel files, ensuring unified access for the data science team.

Database Management: Set up and manage scalable data storage solutions using cloud-based data warehouses (BigQuery), as well as NoSQL databases (MongoDB) for real-time and unstructured data processing.

Containerization and Deployment: Use Docker to containerize data pipelines and services for smooth deployment across environments. Manage containers and orchestration for scalable and repeatable workflows.

Data Quality Assurance: Implement processes to clean, validate, and monitor data quality, ensuring consistency and accuracy across all datasets used by the data science and product teams.

Collaboration with Data Scientists: Work alongside data scientists to understand their data needs, enabling effective model development, testing, and deployment by providing clean, well-organized datasets.

Optimization and Scalability: Design data solutions that can scale with the business, ensuring fast, reliable data access for real-time analytics and model training.

Cloud Infrastructure Management: Work with Google Cloud Platform (GCP) to set up and manage cloud infrastructure for data storage and processing, optimizing the use of GCP services like Cloud Functions, Cloud Run, Google Dataflow and BigQuery.

Data Security and Compliance: Ensure data handling follows industry best practices for security, privacy, and compliance with applicable regulations.

Workflow Automation: Implement workflow automation and orchestration tools like Apache Airflow or Prefect to ensure smooth and consistent data flow across the entire pipeline.

Compartilhar vaga: