A searchable index of Hacker News “Who is hiring?” job postings.
← All postings · March 2012 thread
Job posting (auto-parsed — see raw text)
Original posting
San Francisco CA (remote okay)
Crawl Engineer and Big Data Enthusiast
We're looking for someone enthusiastic about open source, net neutrality, open data and keeping the web truly open. Common Crawl is dedicated to building and maintaining an open repository of web crawl data in order to enable a new wave of innovation, education and research. If you're looking to do work that matters, come join us!
We're set to do amazing things this year, and there is no better place to hone your big data skills than helping us manage and process our 50 TB corpus. Plus, you'll be working within a passionate community and have the chance to interface with plenty of talented researchers, educators, startup folks, and an incredible advisory board.
Responsibilities
* Improve the stability, scaling, and visibility of our distributed web crawler
* Use, improve, and extend our post-crawl, Hadoop-based data processing pipeline
* Design and build mechanism for specification and execution of custom crawls
Desired Skills & Experience
* You can architect and code for a system with tens of billions of documents
* Strong coding ability in Java
* Strong coding skills in at least one scripting language (Python, Ruby, Perl...)
* In-depth knowledge of HTTP and are familiar with web crawlers
* You have development and administrative experience with Hadoop and HDFS
* Ops experience with Linux or other UNIX
* Some familiarity with AWS, including one or more of EC2, S3, EBS, and EMR
* Like to build useful, thorough documentation of code and systems
* Self-starter wiling to take ownership of projects
http://www.commoncrawl.org