![]() |
Welcome to the MLBot FAQ (Frequently Asked Questions) page!
If the FAQ doesn't answer your questions feel free to email us at:
FAQ Contents
MLBot is a focused web crawler. Unlike traditional crawlers such as GoogleBot (Google), MSNBot (Microsoft) and Slurp (Yahoo) that crawl for breadth, a focused crawler is designed to be highly selective. Focused crawlers prioritize the links they crawl based on the probability the links will lead to content that interests the crawler. Focused crawlers are more complex to create but in exchange they make better use of bandwidth and server resources.
Interested in learning more about focused web crawling? The following resources are a great starting point.
Wikipedia has several articles about web crawlers:
A short description of focused crawling
How web crawlers work
Distributed web crawling
Search engine indexing
Video search engines
An early paper on focused crawling. Authors: Soumen Chakrabarti, Martin van den Berg & Byron Dom:
Focused crawling: a new approach to topic-specific Web resource discovery
The Nalanda iVia Focused Crawler is an open-source focused crawler created by Soumen Chakrabarti:
The Nalanda iVia Focused Crawler
The Combine open-source focused crawler:
Combine, main page
A brief introduction to focused crawling. Authors: Ah Chung Tsoi, Daniele Forsali, Marco Gori, Markus Hagenbuchner & Franco Scarselli:
A Simple Focused Crawler
A short description of BINGO! Authors: Sergej Sizov, Stefan Siersdorfer, Martin Theobald & Gerhard Weikum:
The BINGO! Focused Crawler: From Bookmarks to Archetypes
A focused crawler designed to find web pages with content similar to a target page. Authors: Mohsen Jamali, Hassan Sayyadi, Babak Bagheri Hariri & Hassan Abolhassani:
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity (PDF)
A focused crawler tailored to social media sites. Authors: Zhiyong Zhang and Olfa Nasraoui:
Profile-Based Focused Crawling for Social Media-Sharing Websites
Ignacio Dorado's master's thesis on focused crawling:
Focused Crawling: algorithm survey and new approaches with a manual analysis (PDF)
Using a focused crawler to locate missing documents. Authors: Ziming Zhuang, Rohit Wagle & C. Lee Giles:
What’s There and What’s Not? Focused Crawling for Missing Documents in Digital Libraries (PDF)
No list of citations would be complete without Google's seminal paper. Authors: Sergey Brin & Lawrence Page:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
MLBot looks for links to videos from sources like YouTube, CNN, Vimeo and ESPN as well as mp3 audio files. It directs the crawl to favor newer media over older media.
What are you doing with the data you collect?
Glad you asked! We are creating two products: Podly, a real-time media service and Buzz Cruncher, an online media analytics tool.
"Imagine a day when you would be in total control of creating your own TV channel lineup.
Instead of subscribing to a service from a cable, satellite or phone company that might offer you hundreds of channels you'll never watch, you would be able to select what you want and watch it on your own schedule."
That describes Podly pretty well. We believe the future of TV is on the internet and when it makes that transition it should become something more than just a copy of how we use TV today. We must abandon the old model of network broadcasting that offers limited choices and embrace what the internet has to offer. Podly is our dream of how to make TV better and more useful to everyone.
You should have your own channels customized to your interests. Would you like a channel about oil painting? Bicycle touring? How about cute puppies? Want something else? Just make a channel on any topic and Podly will fill it with interesting videos. It should be your choice-- not some network's choice.
You shouldn't have to buy a digital video recorder to get control of your own schedule. We think that's crazy. The internet doesn't care about time slots. Live events like music concerts and sports will always be the exception but otherwise you should be able to watch at your own convenience.
Podly is currently in private beta and is expected to be available in April, 2010. Podly will be free (ad-supported) with additional features available by subscription.
Buzz Cruncher is an analytics tool for media creators, hosts and web sites. It helps you find answers to these questions:
Creators: How popular am I? Who and where is my audience? How does my popularity change by geographic region and time? What other creators are popular with the same audience as mine?
Hosts: Who is linking to my media? Which titles are most popular?
Web sites: Is media helping me attract and retain visitors? How do I compare to other web sites?
Buzz Cruncher will be available in March, 2010. Basic access will always be free. Upgraded reports will be available on a subscription basis. If there is enough interest we'll make a public API available.
What IP addresses does MLBot crawl from?
MLBot crawls from the following IP addresses:
66.219.58.34
66.219.58.41
66.219.58.42
66.219.58.43
66.219.58.44
66.219.58.45
71.41.201.34
71.41.201.35
71.41.201.36
71.41.201.37
71.41.201.38
We're four guys and a cat! David Stafford, Jim Mischel, Joe Langeway, Ron Murray and Socks (he's the cat.)
Metadata Labs is a small startup in Austin, Texas and our dream is to organize all the world's media and make it a lot more useful. Media on the internet is a bit of a mess today. It's disorganized and there's a painful lack of standards. We're working to clean it up and make it an easier and more enjoyable experience. Any media, any device, anywhere, anytime! We hope you'll like what we're building!
Ack! We try hard to make MLBot reliable and we definitely want to hear about anything that's causing you a problem. Please check the Latest news section for updates on recent bug fixes.
The most common question we get is, "I'm seeing multiple HTTP requests for the same mp3 files in my logs. Why are you downloading them multiple times?"
We respect your bandwidth and we do not download entire media files from your server. We access only small file segments that are likely to contain metadata (at the beginning and end of the file.) Each separate HTTP request will appear in your server log. We're trying hard to keep bandwidth usage to a minimum and it's better to make multiple accesses than one big download.
If you notice any unusual activity from MLBot you can reach us at:
We want to hear from you and we will respond promptly. Your feedback has helped us to improve MLBot.
If you have a problem or concern about MLBot we much prefer to have the chance to address it but if you need to block MLBot we do respect the robots.txt exclusion list. To block MLBot from some parts of your web site you can use the following example:
User-agent: MLBot
Disallow: /upload_dir/
Disallow: /draft_podcasts/
In this example, /upload_dir/ and /draft_podcasts/ are directories that will be blocked to MLBot and won't be crawled. Other parts of your web site will still be crawled.
To block MLBot from your entire web site you can use this:
User-agent: MLBot
Disallow: /
Please note, our web crawler caches robots.txt files and it can take 24 hours before any changes you make will take effect.
More information on robots.txt can be found at http://www.robotstxt.org
Is MLBot available for licensing?
No. We're 100% focused on Podly and Buzz Cruncher and doing anything else would take our attention away from getting these products done.
2010/02/05 - Podly has been delayed until April. Our testers told us what we needed (if not wanted) to hear. The product is usable and functions but the interface design is earning less-than-rave reviews like "boring", "feels too much like work", and "should be more like TV." We're grateful for your honest feedback. There is no quick fix for the current design so we're scrapping it and starting over (not the entire product-- just the interface design.) It will set us back a few months. We're very encouraged by what we see in our first prototype of the next interface and we hope you'll agree it's worth the delay.
2009/11/19 - Socks the Cat has been relieved of all responsibility for carrier pigeon communications effective immediately. In hindsight, it should have been obvious that putting a cat in charge of carrier pigeons would eventually lead to disaster. We regret the mistake and offer our deepest apologies to Acme Carrier Pigeon Corp.
2009/09/21 - Fixed a bug where, under certain conditions on Apache web servers delivering an HTML directory listing, we would append an incorrect query string to the url in our HTTP request. A recent crawler update attempted to minimize crawling of directory listings by avoiding selecting the "Name", "Last Modified", "Size", and "Description" fields that simply return the same directory listing in a different order. These fields were correctly removed from the url by the crawler, however, any query string that was left over from a previously-crawled url would be appended. The resulting url was bizarre but still functional as the query string is ignored by Apache. It's fixed and we appreciate the bug report!
Some of life's biggest questions have no answer but if you're in Austin drop by and we'll show you the best for both!
I have a question your FAQ doesn't answer.
We'd love to hear from you! Feel free to write to us here: