HN Jobs

A searchable index of Hacker News “Who is hiring?” job postings.

← All postings · March 2012 thread

Job posting (auto-parsed — see raw text)

Websitecommoncrawl.org
Role taxonomySoftware Engineering
SpecialtiesSoftware Engineering
LocationSan Francisco CA · Remote
Salary
Apply viaSee posting
Hiring notes
TechPythonJavaRubyAWS
Parsed locationsSan Francisco CA
Posted byLisaG
PostedMar 1, 2012
SourceView on Hacker News ↗

Original posting

San Francisco CA (remote okay) Crawl Engineer and Big Data Enthusiast We're looking for someone enthusiastic about open source, net neutrality, open data and keeping the web truly open. Common Crawl is dedicated to building and maintaining an open repository of web crawl data in order to enable a new wave of innovation, education and research. If you're looking to do work that matters, come join us! We're set to do amazing things this year, and there is no better place to hone your big data skills than helping us manage and process our 50 TB corpus. Plus, you'll be working within a passionate community and have the chance to interface with plenty of talented researchers, educators, startup folks, and an incredible advisory board. Responsibilities * Improve the stability, scaling, and visibility of our distributed web crawler * Use, improve, and extend our post-crawl, Hadoop-based data processing pipeline * Design and build mechanism for specification and execution of custom crawls Desired Skills & Experience * You can architect and code for a system with tens of billions of documents * Strong coding ability in Java * Strong coding skills in at least one scripting language (Python, Ruby, Perl...) * In-depth knowledge of HTTP and are familiar with web crawlers * You have development and administrative experience with Hadoop and HDFS * Ops experience with Linux or other UNIX * Some familiarity with AWS, including one or more of EC2, S3, EBS, and EMR * Like to build useful, thorough documentation of code and systems * Self-starter wiling to take ownership of projects http://www.commoncrawl.org