Internet Archive

A searchable index of Hacker News “Who is hiring?” job postings.

← All postings · December 2022 thread

Data Engineer

Company	Internet Archive
Website	archive.org ↗
Roles	Data Engineer Turn researcher Jupyter notebooks into robust systems
Type	full-time
Role taxonomy	Data / Analytics
Specialties	Data Engineering
Location	Remote (US)
Salary	—
Apply via	Email — avdempsey@archive.org
Hiring notes	—
Tech	Python Scala ML/AI
Regions	US
Posted by	avdempsey
Posted	Dec 1, 2022
Source	View on Hacker News ↗

Original posting

Internet Archive | Data Engineer | Remote (US, CA) | Full-Time | archive.org Internet Archive is a non-profit building a free library of all of the published works of humanity to share with the world. We're not there yet, but we've managed to accumulate some data along the way. Can you help us engineer it? The Archiving and Data Services department provides services to mission-aligned organizations (primarily other libraries and cultural heritage institutions). These services include: web crawling SaaS, managed large-scale crawls, long-term digital preservation, and particularly relevant for this role: making use of these web archives and digital collections. We're looking for a Data Engineer to help us with some of the following: - Turn researcher Jupyter notebooks into robust systems (these notebooks are mostly in Scala) - Develop data munging/wrangling/deriving workflows (we use Spark and Temporal.io) - Help administrate a 7.5 Petabyte Hadoop cluster - Potentially write jobs for our main, in-house long term storage cluster - There's always APIs that need work (these are mostly in Python) - ML experience is an interesting bonus We're fully remote, employees can be based anywhere in US or Canada. This is a new opening as of Dec 1, so new we're still working on getting it posted. If interested, please reach out to Alex at avdempsey [at] archive [dot] org.