Job posting · HN Jobs

A searchable index of Hacker News “Who is hiring?” job postings.

← All postings · March 2012 thread

Job posting (auto-parsed — see raw text)

Website	commoncrawl.org ↗
Role taxonomy	Software Engineering
Specialties	Software Engineering
Location	San Francisco CA · Remote
Salary	—
Apply via	See posting
Hiring notes	—
Tech	Python Java Ruby AWS
Parsed locations	San Francisco CA
Posted by	LisaG
Posted	Mar 1, 2012
Source	View on Hacker News ↗

Original posting

San Francisco CA (remote okay) Crawl Engineer and Big Data Enthusiast We're looking for someone enthusiastic about open source, net neutrality, open data and keeping the web truly open. Common Crawl is dedicated to building and maintaining an open repository of web crawl data in order to enable a new wave of innovation, education and research. If you're looking to do work that matters, come join us! We're set to do amazing things this year, and there is no better place to hone your big data skills than helping us manage and process our 50 TB corpus. Plus, you'll be working within a passionate community and have the chance to interface with plenty of talented researchers, educators, startup folks, and an incredible advisory board. Responsibilities * Improve the stability, scaling, and visibility of our distributed web crawler * Use, improve, and extend our post-crawl, Hadoop-based data processing pipeline * Design and build mechanism for specification and execution of custom crawls Desired Skills & Experience * You can architect and code for a system with tens of billions of documents * Strong coding ability in Java * Strong coding skills in at least one scripting language (Python, Ruby, Perl...) * In-depth knowledge of HTTP and are familiar with web crawlers * You have development and administrative experience with Hadoop and HDFS * Ops experience with Linux or other UNIX * Some familiarity with AWS, including one or more of EC2, S3, EBS, and EMR * Like to build useful, thorough documentation of code and systems * Self-starter wiling to take ownership of projects http://www.commoncrawl.org