Uncle, Mercy, Whatever: Please, Just Kill the Dupes

If I See This One More Time…

Originally uploaded by sogrady.

As I explained it to one audience yesterday, I consider services like PubSub and Technorati’s Watchlists to be the rough blogging equivalent of Google’s News Alerts. One scans blogs, the other news, but same principle. They watch a channel for mentions of a keyword or link.

In concept, it’s a terrific idea. In execution, it’s mostly still a terrific idea. But lately, it’s less terrific. Much less.

PubSub, of the available services, has been for me the least prone to duplicate mentions – i.e. false-positive returns that come back again and again. But since the Adobe/Macromedia news, PubSub has taken to showing me the pictured Industry Standard story featuring a quote from my esteemed colleague several times per day, often with multiple mentions in the same return. Each and every day, I’m treated to that same article somewhere between 5 and 10 times. It’s driving me crazy, to the point that I’m considering dropping that particular watchlist.

So while I don’t know if Tim is right, and Atom’s the answer to this particular problem, I do know that it’s a problem that needs to be solved. Because I’m certainly not the only one with such a problem. Please guys, kill the dupes.


  1. When I subscribe to a feed in Bloglines, I start with the "Display updated entries" option. If the feed passes some dupe threshold of annoyance, I change that to "ignore updated entries." This seems to take care of 90% of this problem for me.

  2. i can't believe i never thought of that; i'm an idiot. either way, bless you sir.

  3. Of course, a few hours after I left that comment I get four bajillion dupes in several feeds which I've checked with `ignore', so it's not the be-all end-all solution by any means.

  4. the worst thing – i hate the quote in that article!!! i mean coldfusion is one thing i mentioned. now like ten times a day the story comes leaping out at me and i think why didn't i say something more insightful….

    and you begin to imagine that *everyone* keeps seeing the same story over and over, even though its just hitting me because of my vanity feed…

  5. might this be something to do with industry standards feed? normally technorati does this, not pubsub so much

  6. The problem seems to be the way that ads are being inserted into the entry you are getting multiple copies of. We're working to see what we can do to work around this example of really bad ad insertion practice.

    It appears that the Industry Standard is serving ads from Doubleclick and the method they are using is simply wrong. What is happening is that the ad links are changing on a regular basis and thus our "duplicate detection" code thinks that the entry has changed. This will impact any post from an Industry Standard feed with ads in it. The fact that you're using PubSub is not relevant here. Just about any aggregator will get confused by posts that have changing content like this.

    Because of weaknesses in the RSS specification that are being addressed in the new Atom format, aggregators are currently forced to do textual analysis in order to detect changes to RSS posts. We can't simply rely on things like the optional RSS GUID when checking for duplicates. In the future, Atom will give us required unique IDs for entries and that will make it easier to detect duplicates… But, for now, we're stuck with a lot of old-style RSS feeds… Given this, it is vitally important that when someone inserts ads in RSS items, the text of the ad links *must not change* once the ad is inserted. The only alternative would be for us to build special code that recognizes the ads being inserted by each of the many advertisers and handles them specially. Personally, I don't think that is reasonble, however, if we only have a few "bad apples" (like DoubleClick), it might be workable for the short term until people start moving to Atom.

    I'll write more on this on my blog later today. See: http://bobwyman.pubsub.com/

    bob wyman

