How do you develop and run a grid and data-mining on-demand service? That’s what we find out in this installment of What’s In Your Stack, with 80legs. And if you’re interested in telling us about your stack, feel free to just fill out the questionare!
Who are you guys?
I’m Shion Deysarkar, the CEO of 80legs, a web crawling and data collection services provider. We have three lines of business, all supported by the same stack: custom web crawling services, data feeds, and a large “general” crawl of the entire web. Most of our business centers around delivering and discovering structured data on the web. This data includes everything from social data to business listing data to meta data (e.g., where all the PDFs are).
We are working on extending our stack even further. Right now our stack looks like this:
- 50,000+ grid computing nodes
- Grid control servers
- Web crawling control servers
- Crawl job submission application
We are extending it to this:
- 50,000+ grid computing nodes
- Grid control servers
- Web crawling control servers
- Crawl job submission application
- document storage system
- document search/processing application
These extensions will provide us and our customers with a complete end-to-end system for searching and retrieving structured data from the entire web. So if you want to run a query like “create a CSV table of all florists in Texas with at least a 3 star rating.. oh, and tell me everyone that has been there as well”, that will be possible here.
How would you describe your development process?
We have weekly iteration meetings each Monday where we discuss:
- Tasks completed last week and any challenges faced during them.
- Tasks to be completed this week.
- Overall company issues.
Developers are fairly autonomous within their task assignments.
What’s something that has worked well for you guys?
The weekly iteration meetings have gone very well for us. We started this practice about 2-3 months ago, and it has really helped us keep a steady pace as far as deliverables. It has also helped shape our culture toward efficiency and productivity.
[Across the board, quick frequent meetings seem to be extremely effective for development teams. The “stand-up meeting” seems like one of the quickest to value tools that Scrum provides. -Coté]
What development (IDE, build tool, etc.) and project management tools did you guys use, if any?
- Pivotal Tracker for task assignment.
- Google Mail/Calendar and Skype for communication.
- Eclipse and SVN for development.
- Ubuntu for server OS.
- Apache/Tomcat for web servers.
- Dell 1U and 2U servers for local data center.
- 50,000+ personal computers for grid nodes.
Have you guys considered using git and/or GitHub?
We looked at GitHub a while back, but since most of our guys were familiar with SVN, we went that. I think at this point it would be too much of a hassle to switch, and we don’t see much benefit at this point. It would be nice to take advantage of the community on GitHub, so we may have a separate public repo there for public code.
What do you use your local datacenter for?
The local data center consists of 2 racks that manage crawl job submission and our distributed grid. We require a fairly small data center foot print given what we do thanks to the grid architecture.
Through our investor (Creeris Ventures), we have access to deals since as a whole Creeris companies do a decent amount of server purchasing. So we can get a couple thousand knocked off listed prices, etc. We have had some issues with hard drives, but we do put our machines through the ringer, so maybe it’s to be expected.
Disclosure: GitHub is a client, as are Dell and Canonical.
Recent Comments