NoFluffJobs Stacjonarnie Mid New

Data Engineer

VIRTUSA

⚲ Kraków

21 840 - 24 024 PLN (B2B)

Wymagania

  • Backbone
  • AI
  • Machine learning
  • ETL
  • Automated testing
  • Google Cloud Storage
  • Apache Spark
  • Flink
  • Kafka
  • Terraform
  • Docker
  • Kubernetes
  • SQL
  • Python (nice to have)
  • SQL (nice to have)
  • Java (nice to have)
  • Scala (nice to have)
  • Data pipelines (nice to have)
  • Apache Beam (nice to have)
  • Spark (nice to have)
  • PUB (nice to have)
  • REST API (nice to have)
  • Microservices architecture (nice to have)

Opis stanowiska

O projekcie: We are seeking a seasoned Data Engineer who isn't just a builder of pipelines, but an architect of intelligence. With 3 years of experience, you have moved beyond "getting things to work" to "making things scale." You will be the backbone of our AI initiatives, ensuring that our models are fed high-quality, high-velocity data. As a proactive member of a fast-paced team, you will thrive in a startup-like environment where ownership is the default and ambiguity is seen as an opportunity to build structure. Wymagania: - Proficiency in Python and SQL. Experience with Java or Scala is a plus. - Hands-on experience with frameworks like Apache Spark, Flink, or Kafka. - Experience supporting AI projects (e.g., handling embeddings, managing datasets for LLM fine-tuning, or working with tools like LangChain or LlamaIndex). - Comfortable with Terraform or Docker/Kubernetes to manage data environments. - Communicative English Nice to have: - BigQuery: Optimization and partitioning strategies. - Vertex AI: Experience building and deploying data pipelines within the Vertex ecosystem. - Dataflow/Dataproc: Experience with managed Apache Beam or Spark services. - Google Pub/Sub: Building real-time event-driven architectures. - Understanding of REST APIs and microservices architecture Codzienne zadania: - Design and scale data architectures specifically tailored for Machine Learning (ML) lifecycles, including feature stores, vector databases, and model training pipelines. - Architect, build, and maintain robust ETL/ELT pipelines that handle structured and unstructured data with a focus on low latency and high reliability. - Identify bottlenecks in the data lifecycle before they impact the team. You don't wait for a ticket; you see a manual process and automate it. - Work closely with Data Scientists and AI Researchers to understand model requirements and translate them into technical data specifications. - Implement automated testing and monitoring for data integrity, ensuring that "garbage in, garbage out" is never a reality for our AI models. - Build and maintain data lakes and feature stores on Google Cloud Storage. - Implement real-time and batch processing architectures for AI-driven applications