Welcome to the MLBot FAQ (Frequently Asked Questions) page!
If the FAQ doesn't answer your questions feel free to email us at:
MLBot is a focused web crawler. Unlike traditional crawlers such as GoogleBot (Google), MSNBot (Microsoft) and Slurp (Yahoo) that crawl for breadth, a focused crawler is designed to be highly selective. Focused crawlers prioritize the links they crawl based on the probability the links will lead to content that interests the crawler. Focused crawlers are more complex to create but in exchange they make better use of bandwidth and server resources.
Interested in learning more about focused web crawling? The following resources are a great starting point.
An early paper on focused crawling. Authors: Soumen Chakrabarti, Martin van den Berg & Byron Dom:
Focused crawling: a new approach to topic-specific Web resource discovery
The Nalanda iVia Focused Crawler is an open-source focused crawler created by Soumen Chakrabarti:
The Nalanda iVia Focused Crawler
The Combine open-source focused crawler:
Combine, main page
A brief introduction to focused crawling. Authors: Ah Chung Tsoi, Daniele Forsali, Marco Gori, Markus Hagenbuchner & Franco Scarselli:
A Simple Focused Crawler
A short description of BINGO! Authors: Sergej Sizov, Stefan Siersdorfer, Martin Theobald & Gerhard Weikum:
The BINGO! Focused Crawler: From Bookmarks to Archetypes
A focused crawler designed to find web pages with content similar to a target page. Authors: Mohsen Jamali, Hassan Sayyadi, Babak Bagheri Hariri & Hassan Abolhassani:
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity (PDF)
A focused crawler tailored to social media sites. Authors: Zhiyong Zhang and Olfa Nasraoui:
Profile-Based Focused Crawling for Social Media-Sharing Websites
Ignacio Dorado's master's thesis on focused crawling:
Focused Crawling: algorithm survey and new approaches with a manual analysis (PDF)
Using a focused crawler to locate missing documents. Authors: Ziming Zhuang, Rohit Wagle & C. Lee Giles:
What’s There and What’s Not? Focused Crawling for Missing Documents in Digital Libraries (PDF)
No list of citations would be complete without Google's seminal paper. Authors: Sergey Brin & Lawrence Page:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
MLBot looks for links to videos from sources like YouTube, CNN, Vimeo and ESPN as well as mp3 audio files. It directs the crawl to favor newer media over older media.
"Imagine a day when you would be in total control of creating your own TV channel lineup.
Instead of subscribing to a service from a cable, satellite or phone company that might offer you hundreds of channels you'll never watch, you would be able to select what you want and watch it on your own schedule."
That describes Podly pretty well. We believe the future of TV is on the internet and when it makes that transition it should become something much more than just a copy of how we use TV today. We can abandon the old model of network broadcasting that offers limited choices and embrace what the internet has to offer. Podly is our dream of how to make TV better and more useful to everyone.
You should have your own channels customized to your interests. Would you like a channel about oil painting? Bicycle touring? How about cute puppies? Want something else? Just make a channel on any topic and Podly will fill it with interesting videos.
You shouldn't have to buy a digital video recorder to get control of your own schedule. We think that's crazy. The internet doesn't care about time slots. Live events like music concerts and sports will always be the exception but otherwise you should be able to watch at your own convenience.
While Podly is a general internet TV service, debsnews is focused purely on delivering up-to-the-minute news. It has fewer features than Podly but it has a simpler design and getting to news is faster.
MLBot crawls from the following IP addresses:
There are evil crawlers out there that disguise themselves as good crawlers. Blocking them in robots.txt doesn't work (they just ignore it.) The only way to stop them is to ban their IP address.
These are IP addresses of malevolent crawlers who have spoofed MLBot's user agent string. If you see a crawler with the MLBot user agent that doesn't come from one of the IP addresses listed in What IP addresses does MLBot crawl from? please report it to us and we'll add it to the Wall of Shame.
We're four guys and a cat! David Stafford, Jim Mischel, Joe Langeway, Ron Murray and Socks (he's the cat.)
Metadata Labs is a small startup in Austin, Texas and our dream is to organize all the world's media and make it a lot more useful. Media on the internet is a bit of a mess today. It's disorganized and there's a painful lack of standards. We're working to clean it up and make it an easier and more enjoyable experience. Any media, any device, anywhere, anytime! We hope you'll like what we're building!
Ack! We try hard to make MLBot reliable and we definitely want to hear about anything that's causing you a problem. Please check the Latest news section for updates on recent bug fixes.
The most common question we get is, "I'm seeing multiple HTTP requests for the same mp3 files in my logs. Why are you downloading them multiple times?"
We respect your bandwidth and we do not download entire media files from your server. We access only small file segments that are likely to contain metadata (at the beginning and end of the file.) Each separate HTTP request will appear in your server log. We're trying hard to keep bandwidth usage to a minimum and it's better to make multiple accesses than one big download.
If you notice any unusual activity from MLBot you can reach us at:
We want to hear from you and we will respond promptly. Your feedback has helped us to improve MLBot.
If you have a problem or concern about MLBot we much prefer to have the chance to address it but if you need to block MLBot we do respect the robots.txt exclusion list. To block MLBot from some parts of your web site you can use the following example:
In this example, /upload_dir/ and /draft_podcasts/ are directories that will be blocked to MLBot and won't be crawled. Other parts of your web site will still be crawled.
To block MLBot from your entire web site you can use this:
Please note, our web crawler caches robots.txt files and it can take 24 hours before any changes you make will take effect.
More information on robots.txt can be found at http://www.robotstxt.org
You can contact us for with licensing questions here:
2010/12/03 - We rolled out an update to the crawlers today. Details here: Improving our handling of robots.txt
2010/12/01 - Added another IP address to our Wall of Shame. 22.214.171.124 appears to be based in Roubaix, France. Shame on you. Thanks for the report, Scott.
2010/10/19 - Warning: Other crawlers have been caught spoofing our user agent of "MLBot". Blocking them in robots.txt doesn't work (these are bad guys and they just ignore it.) The only thing you can do to stop them is to block their IP address. If you see a crawler with the MLBot user agent that doesn't come from one of the IP addresses listed in What IP addresses does MLBot crawl from? please report it to us. If it isn't MLBot we'll add it to our Wall of Shame.
2009/11/19 - Socks the Cat has been relieved of all responsibility for carrier pigeon communications effective immediately. In hindsight, it should have been obvious that putting a cat in charge of carrier pigeons would eventually lead to disaster. We regret the mistake and offer our apologies to Acme Carrier Pigeon Corp.
2009/09/21 - Fixed a bug where, under certain conditions on Apache web servers delivering an HTML directory listing, we would append an incorrect query string to the url in our HTTP request. A recent crawler update attempted to minimize crawling of directory listings by avoiding selecting the "Name", "Last Modified", "Size", and "Description" fields that simply return the same directory listing in a different order. These fields were correctly removed from the url by the crawler, however, any query string that was left over from a previously-crawled url would be appended. The resulting url was bizarre but still functional as the query string is ignored by Apache. It's fixed and we appreciate the bug report!
Some of life's biggest questions have no answer but if you're in Austin drop by and we'll show you the best for both!
We'd love to hear from you! Feel free to write to us here: