Welcome to the MLBot FAQ (Frequently Asked Questions) page!

If the FAQ doesn't answer your questions feel free to email us at:





FAQ Contents

  1. What is MLBot?

  2. What is MLBot looking for?

  3. What are you doing with the data you collect?

  4. What IP addresses does MLBot crawl from?

  5. The Wall of Shame

  6. Who are you guys?

  7. I found a bug in MLBot!

  8. How do I block MLBot?

  9. Is MLBot available for licensing?

  10. Latest news

  11. BBQ or Tex-Mex?

  12. I have a question your FAQ doesn't answer.



  1. What is MLBot?

    MLBot is a focused web crawler. Unlike traditional crawlers such as GoogleBot (Google), MSNBot (Microsoft) and Slurp (Yahoo) that crawl for breadth, a focused crawler is designed to be highly selective. Focused crawlers prioritize the links they crawl based on the probability the links will lead to content that interests the crawler. Focused crawlers are more complex to create but in exchange they make better use of bandwidth and server resources.

    Interested in learning more about focused web crawling? The following resources are a great starting point.

  2. What is MLBot looking for?

    MLBot looks for links to videos from sources like YouTube, CNN, Vimeo and ESPN as well as mp3 audio files. It directs the crawl to favor newer media over older media.

  3. What are you doing with the data you collect?

    Glad you asked! We are creating two products: Podly, an internet TV service, and debsnews, for up-to-the-minute video news.

    "Imagine a day when you would be in total control of creating your own TV channel lineup.

    Instead of subscribing to a service from a cable, satellite or phone company that might offer you hundreds of channels you'll never watch, you would be able to select what you want and watch it on your own schedule."

    -- CNET News - The Internet and the future of TV

    That describes Podly pretty well. We believe the future of TV is on the internet and when it makes that transition it should become something much more than just a copy of how we use TV today. We can abandon the old model of network broadcasting that offers limited choices and embrace what the internet has to offer. Podly is our dream of how to make TV better and more useful to everyone.

    You should have your own channels customized to your interests. Would you like a channel about oil painting? Bicycle touring? How about cute puppies? Want something else? Just make a channel on any topic and Podly will fill it with interesting videos.

    You shouldn't have to buy a digital video recorder to get control of your own schedule. We think that's crazy. The internet doesn't care about time slots. Live events like music concerts and sports will always be the exception but otherwise you should be able to watch at your own convenience.

    While Podly is a general internet TV service, debsnews is focused purely on delivering up-to-the-minute news. It has fewer features than Podly but it has a simpler design and getting to news is faster.

  4. What IP addresses does MLBot crawl from?

    MLBot crawls from the following IP addresses:

    66.219.58.34
    66.219.58.35
    66.219.58.36
    66.219.58.37
    66.219.58.38
    66.219.58.39
    66.219.58.40
    66.219.58.41
    66.219.58.42
    66.219.58.43
    66.219.58.44
    66.219.58.45

    71.41.201.34
    71.41.201.35
    71.41.201.36
    71.41.201.37
    71.41.201.38
  5. The Wall of Shame

    There are evil crawlers out there that disguise themselves as good crawlers. Blocking them in robots.txt doesn't work (they just ignore it.) The only way to stop them is to ban their IP address.

    These are IP addresses of malevolent crawlers who have spoofed MLBot's user agent string. If you see a crawler with the MLBot user agent that doesn't come from one of the IP addresses listed in What IP addresses does MLBot crawl from? please report it to us and we'll add it to the Wall of Shame.

    66.96.219.133
    69.175.22.106
    74.86.66.201
    94.23.58.72
    210.71.167.35
    216.18.196.2
  6. Who are you guys?

    We're four guys and a cat! David Stafford, Jim Mischel, Joe Langeway, Ron Murray and Socks (he's the cat.)

    Metadata Labs is a small startup in Austin, Texas and our dream is to organize all the world's media and make it a lot more useful. Media on the internet is a bit of a mess today. It's disorganized and there's a painful lack of standards. We're working to clean it up and make it an easier and more enjoyable experience. Any media, any device, anywhere, anytime! We hope you'll like what we're building!

  7. I found a bug in MLBot!

    Ack! We try hard to make MLBot reliable and we definitely want to hear about anything that's causing you a problem. Please check the Latest news section for updates on recent bug fixes.

    The most common question we get is, "I'm seeing multiple HTTP requests for the same mp3 files in my logs. Why are you downloading them multiple times?"

    We respect your bandwidth and we do not download entire media files from your server. We access only small file segments that are likely to contain metadata (at the beginning and end of the file.) Each separate HTTP request will appear in your server log. We're trying hard to keep bandwidth usage to a minimum and it's better to make multiple accesses than one big download.

    If you notice any unusual activity from MLBot you can reach us at:

    We want to hear from you and we will respond promptly. Your feedback has helped us to improve MLBot.

  8. How do I block MLBot?

    If you have a problem or concern about MLBot we much prefer to have the chance to address it but if you need to block MLBot we do respect the robots.txt exclusion list. To block MLBot from some parts of your web site you can use the following example:

    User-agent: MLBot
    Disallow: /upload_dir/
    Disallow: /draft_podcasts/

    In this example, /upload_dir/ and /draft_podcasts/ are directories that will be blocked to MLBot and won't be crawled. Other parts of your web site will still be crawled.

    To block MLBot from your entire web site you can use this:

    User-agent: MLBot
    Disallow: /

    Please note, our web crawler caches robots.txt files and it can take 24 hours before any changes you make will take effect.

    More information on robots.txt can be found at http://www.robotstxt.org

  9. Is MLBot available for licensing?

    You can contact us for with licensing questions here:

  10. Latest news

    • 2010/12/03 - We rolled out an update to the crawlers today. Details here: Improving our handling of robots.txt

    • 2010/12/01 - Added another IP address to our Wall of Shame. 94.23.58.72 appears to be based in Roubaix, France. Shame on you. Thanks for the report, Scott.

    • 2010/10/19 - Warning: Other crawlers have been caught spoofing our user agent of "MLBot". Blocking them in robots.txt doesn't work (these are bad guys and they just ignore it.) The only thing you can do to stop them is to block their IP address. If you see a crawler with the MLBot user agent that doesn't come from one of the IP addresses listed in What IP addresses does MLBot crawl from? please report it to us. If it isn't MLBot we'll add it to our Wall of Shame.

    • 2009/11/19 - Socks the Cat has been relieved of all responsibility for carrier pigeon communications effective immediately. In hindsight, it should have been obvious that putting a cat in charge of carrier pigeons would eventually lead to disaster. We regret the mistake and offer our apologies to Acme Carrier Pigeon Corp.

    • 2009/09/21 - Fixed a bug where, under certain conditions on Apache web servers delivering an HTML directory listing, we would append an incorrect query string to the url in our HTTP request. A recent crawler update attempted to minimize crawling of directory listings by avoiding selecting the "Name", "Last Modified", "Size", and "Description" fields that simply return the same directory listing in a different order. These fields were correctly removed from the url by the crawler, however, any query string that was left over from a previously-crawled url would be appended. The resulting url was bizarre but still functional as the query string is ignored by Apache. It's fixed and we appreciate the bug report!

  11. BBQ or Tex-Mex?

    Some of life's biggest questions have no answer but if you're in Austin drop by and we'll show you the best for both!

    Mesa Rosa - Austin's finest Tex-Mex
    Rudy's - Real Texas BBQ

  12. I have a question your FAQ doesn't answer.

    We'd love to hear from you! Feel free to write to us here: