Lead Data Engineer Job at LGND AI, Inc., San Francisco, CA

OGJ1dmxjVERLMVEzL1lhdkpWcTFoTERGaFE9PQ==
  • LGND AI, Inc.
  • San Francisco, CA

Job Description

About LGND
LGND is an early-stage startup revolutionizing geospatial AI infrastructure. We bridge the gap between large Earth observation models and specific application developers, enabling intuitive interaction with geospatial data. Our core mission is to empower decision-makers with rapid insights from vast, complex datasets. As part of our small, dynamic team, you will play a foundational role in building tools that have never existed before.

Role Summary

We are seeking a Lead Data Engineer to design, build, and scale our inference pipeline for geospatial embeddings. This pipeline is the backbone of LGND’s technological product, integrating with a point-and-click web application to generate embeddings for geographic areas of interest based on user-defined parameters. These embeddings will populate a custom vector database designed for massive scale and speed.

The ideal candidate is a seasoned engineer with experience in production-grade data pipelines, thrives under uncertainty, and is eager to collaborate across engineering, DevOps, and science disciplines. AI and geospatial experience are not required, if you are willing to learn fast with our help. Over time, this role will evolve into an engineering lead position, overseeing all technological components while focusing on engineering excellence.

Role is remote. We have team members in San Francisco, Philadelphia, and Coppenhagen.

Key Responsibilities

    • Build the Inference Pipeline:
      • Develop a scalable, efficient pipeline to generate geospatial embeddings based on user input, integrating parameters such as geographic area, model type, time range, tiling strategy, and imagery source.
      • Balance pre-processed tokens (e.g., cloud-free Sentinel imagery) with on-the-fly inference for optimal performance.
      • Ensure the pipeline supports billions of embeddings at scale and leverages advanced compute capabilities for fast inference, mostly on commercial clouds but also local resources..
    • Integration and Collaboration:
      • Work closely with front-end engineers to ensure seamless integration of the pipeline into a user-friendly web application.
      • Collaborate with leadership to determine which components of the pipeline and storage system should remain proprietary versus open-source.
      • Partner with external groups like AWS and Asterik Labs for open-source contributions and technical integrations.
    • Scalability and Professionalism:
      • Design a pipeline that other high-level data engineers can immediately inherit and build upon.
      • Move large amounts of data around professionally, focusing on scale, extensibility, and maintainability.
      • Ensure compliance with best practices in data engineering, DevOps, and MLOps.
    • Enhance Existing Projects:
  • Build upon existing foundational work to increase pipeline speed, scale, and extensibility. Key repositories include:
    • embeddings-worker : A Python module that creates vector embeddings of satellite images using the Clay Foundation Model. The system splits geographic regions into smaller chips, processes them in a distributed manner, and manages status tracking in a database.
    • embeddings-api : A REST API module that manages the vector database and orchestrates embedding generation tasks. It includes robust endpoints for scheduling geographic regions for processing, retrieving task status, and searching for similar vectors.
    • Future Leadership:
      • Serve as the lead for the inference pipeline, one of four core technological components at LGND (inference pipeline, fine-tuning and retrieval algorithms, vector search database, and SDK).
      • Optionally grow into an engineering manager role, overseeing future hires and cross-functional development efforts.

Scope of Work: First Two Months

  1. Increase the Speed and Scale of the Pipeline:
    • Optimize the inference pipeline to efficiently handle the generation of embeddings at massive scale.
    • Focus on performance improvements to support billions of embeddings and reduce inference runtime.
  2. Tokenize Source Imagery:
    • Develop a process to "tokenize" source imagery for a given geographic region and time range.
    • Produce image chips according to the large Earth observation model architecture.
    • Store these image chips in Amazon S3 for easy recall during subsequent inference runs.
  3. Run Model Inference:
    • Implement the pipeline to run inference on a couple of existing, pre-trained models.
    • Output the resulting embeddings and store them in a scalable, performant vector search database.
    • Collaborate with external partners, such as AWS, to ensure pipeline compatibility with the vector database infrastructure.
  4. Nice-to-Have Feature:
    • Develop functionality to process source imagery into mosaics to address cloud cover and other image quality issues, improving the quality of inputs for inference.

Scope of Work: First Two Months, expanded

  1. Operationalize the CLIP-based Retrieval Pipeline
    • Implement and optimize a scalable inference pipeline to generate CLIP embeddings (and embeddings from other pre-trained models) for remote sensing imagery.
    • Design the system to tokenize source imagery into manageable image chips for specific geographic areas and time ranges. Store these chips efficiently in Amazon S3 for reuse.
    • Ensure flexibility to incorporate additional embedding models in the future.
  2. Experiment with Multi-Modal Retrieval
    • Enable interaction with both image and text queries in a combined retrieval framework using pre-trained vision-language models (e.g., CLIP).
    • Implement functionality to combine multiple embeddings (image-to-image and text-to-image similarity) and experiment with methods like WEICOM for modality control (e.g., weighted combinations of embeddings) .
  3. Database and API Design
    • Collaborate with external partners (e.g., AWS) to design a scalable vector search database capable of handling billions of embeddings.
    • Develop APIs to allow efficient storage and retrieval of embeddings based on user-defined queries (geographic area, model, time range, and textual context).
  4. Pre-Processing for Image Quality (Nice-to-Have)
    • Develop a feature to process source imagery into cloud-free mosaics, improving image quality for inference and retrieval.
  5. Performance Optimization
    • Optimize the pipeline for speed, ensuring embeddings can be generated at scale. Explore trade-offs between pre-processed tokens and on-the-fly inference.
    • Focus on building a robust, scalable system that reduces latency while maintaining flexibility.

Requirements

Required Technical Skills:

  • Proficiency in Python and familiarity with Docker.
  • Expertise in building production-grade data pipelines at scale (10+ years of experience preferred).
  • Familiarity with tools and frameworks like:
    • Geospatial libraries: numpy, pandas, rasterio, geopandas, xarray.
    • Machine learning: PyTorch (torch, torchdata, torchvision), timm, einops.
    • Cloud integration: boto3 for AWS.
    • Database management: SQLAlchemy, GeoAlchemy2, pgvector, psycopg2.
  • Experience with inference pipelines, including pre-processing and real-time inference strategies.

Preferred Experience:

  • Familiarity with satellite image formats and protocols (e.g., STAC, Cloud Optimized GeoTIFFs, Zarr).
  • Experience with AWS infrastructure (bonus, not required).
  • Background in MLOps and geospatial AI applications.

Soft Skills:

  • Self-led and able to navigate uncertainty.
  • Excited by the opportunity to build tools and systems that have never been built before.
  • Collaborative, humble, and eager to learn.

Benefits

Cultural Values

  • Humility: You value collaboration and learning from others.
  • Integrity: You uphold honesty and transparency in your work.
  • Effectiveness: You are results-driven, with a focus on building scalable, impactful solutions.

Compensation and Benefits

  • Competitive salary based on experience.
  • Equity options in a Seed Stage Startup
  • Flexible work arrangements.
  • Opportunity to play a foundational role in shaping LGND’s technological infrastructure.

Job Tags

Full time, Local area, Immediate start, Remote job, Flexible hours,

Similar Jobs

Esr healthcare

Etl let data engineer remote Job at Esr healthcare

Minimum 10 years of experience Job Title : Lead Data Engineer Location : Remote Duration : 6 months...  ...orchestration tools, creating wrapper scripts Expert level SQL skills and complexStored Procedures Preferable... 

Canadian Health Labs

Registered Nurse Job at Canadian Health Labs

 ...We are seeking dedicated Travel Registered Nurses to join our healthcare team. As a Travel RN, you will...  ...proof of COVID-19 vaccination as per State Immunization Policy. Must be willing...  ...salary Travel allowances Per Diem Networking opportunities with healthcare... 

MKA International, Inc.

Civil/Structural Engineer Job at MKA International, Inc.

 ...Inc. ( MKA ) is a national, multidisciplinary Construction and Engineering Consulting firm. We have been serving our Clients for over 35 years...  ...and Experience requirements: Bachelor of Science degree in Civil Engineering or Architectural Engineering from an ABET... 

Maximum Services Inc.

Heavy Equipment Mechanic Apprentice Job at Maximum Services Inc.

 ...Maximum Services Inc. is looking to fill the role of heavy equipment mechanic apprentice. Responsibilities: Candidate performs diagnosis, service, maintenance, and repairs in accordance with manufacturers recommendations & specifications under supervision of equipment... 

HAN IT Staffing Inc.

Digital forensic / specialist Job at HAN IT Staffing Inc.

 ...Digital Forensic Specialist / Soc Consultant Type of hire: Direct W2 / C2C Work Location: Troy, MI (2-3 days onsite hybrid) Years of experience needed 6+ on SOC investigation/Incident Response and 3 Yrs. on Digital Forensics Tools FTK Forensics...