Wednesday, September 02, 2015

Is Internet news inevitably concentrated and redundant?

I applied to register my blog on the Internet in Cuba with Google News, but was immediately (automatically?) rejected. My blog had 45,370 views last month and a friend told me a site had to have at least 150,000 views per month to be considered for Google News.

I don't know if that is the case, but I frequently post news before it turns up in Google News or as a Google Alert. (The two systems are separate).

Here's an example:

On August 27th, at 2 AM PDT, Bloomberg News published an article entitled "Cuba's Internet Dilemma: How to Emerge From the Web's Stone Age." by Indira Lakshmanan.

The article included a photo, a graph and two ads for IBM. In spite of being short, it had six bold-face section headings. I did not post anything about the article because it offered no news and had a significant misconception.

Later that day, I received a Google Alert for an article with the same title on the SunHerald site. The SunHerald version had a different photo and had cut the graph and sub-heads, but the body text was identical to the Bloomberg version. The byline credited "INDIRA A.R. LAKSHMANAN of Bloomberg News." Notice that she now had two middle initials and her name was all caps.

There was another credit at the end of the article: "Brian Womack contributed from San Francisco."

Brian did not change the body text, but he dropped the section headings and graph, changed the photo and gave Indira two middle initials and capitalized her name. Did his "contribution" take more than ten minutes?

The SunHerald version was preceded by a full-screen ad that I had to click to remove and, when the article was displayed, it had ads for liposuction, a bail bond service, a microwave oven (I had purchased one from Amazon earlier in the week), a smiling, blond investment adviser and two ads for the SunHerald publisher.

Well, this got my curiosity up, so I picked a random sentence from the middle of the article: "Last month, the state telecom monopoly ETECSA created 35 broadband Wi-Fi hotspots across the island, where the public can surf the Web, as Hernandez does" and Googled it. It turned up 18 full-text copies of the post.

Searching on the first sentence of the article: "Julio Hernandez is a telecommunications engineer, but like almost anyone else in Cuba who wants to get on the Internet, to do so he must crouch on a dusty street corner with his laptop, inhaling car exhaust and enduring sweltering heat", turned up many more hits, but a lot of those were snippets with links to a full-text version.

I searched Google Alerts on the key phrase "Cuba Internet" and turned up links to six full-text copies of the story: 1, 2, 3, 4, 5 and 6

I sent email queries asking about the criteria for inclusion in Google News to press@google.com and Stacie Chan, Media Outreach Manager, Google News, but neither replied. I also emailed the Bloomberg press office asking if they had licensed the article to the SunHerald and others, but received no reply.

This experience leads me to think that:
  • Google's News and Alerts algorithms find stories on a topic, not necessarily news or novel analysis.
  • Google's News and Alerts algorithms fail to detect redundancy, which may be intentional because it increases their ad revenue. (Might they discriminate in favor of sites with ads)?
  • Google's algorithms seem to pay attention to sub-headings and images, but not body text, in screening for redundancy.
  • Snippet posts often link to derivative copies rather than the original post (by Bloomberg in this case).
  • Relatively small, focused, long-tail blogs and news sites are not likely to be seen by Google News or Alerts.
Is it inevitable that advertising-based, algorithm-driven Internet news will be redundant and increasingly concentrated in high-volume sites?

I hope not. Perhaps Google (or Facebook) will be clever enough to automate discovery of worthwhile long-tail news sites or use human curators to find them.

Disclosure: I have given permission for copies of blog posts to be posted by others (at no cost).

-----
Update 9/5/2015

As mentioned above, I sent email queries to Google when I got the idea for this post, but did not hear from them before I published it. This morning I got an email from Stacie Chan. Here is what she said:
There aren't any minimum number of clicks needed to get accepted as a News site into Google News. We do, however, have strict quality and technical guidelines that sites must follow to get and accepted and maintain their status in Google News. We accept smaller blogs/sites as well as larger ones.

Google Alerts, as you mentioned, is a separate product from Google News. But many of their sheets are triggered by a new article from publishers on our database.

Hope that helps!
Stacie
(I corrected what appeared to be two cell-phone typos).

-----
Update 10/3/2015

I follow the keyword "ETECSA" on Google Alerts and Google recently alerted me to a post on Cuban plans for home Internet connectivity at Frogoff.com. (ETECSA is Cuba's state-run Internet service provider).

The post was an identical copy -- text, title and images -- to my earlier post.

I don't really care that the slimeballs running Frogoff.com copied my post, but I do care that Google Alerts linked to it, not mine. Google will not tell me how they decide which sites to include in News and Alerts, but, whatever it is, it rewards parasites like Frogoff.com and overlooks relatively small, specialized long-tail sites that cover a given topic. (In this case, the state of the Cuban Internet).

-----
Update 10/12/2015

I received the following Google Alert yesterday:


ETECSA is Cuba's government monopoly telecommunication company. The message alerts me to the fact that ETECSA upgraded their cell phone network in 1999 -- not exactly "news."

The alert links to ETECSA's Wikipedia page. Evidently, Google sends an alert to any change in a page that contains the alert keywords. The alert was not caused by the cell network upgrade, but that sentence contained the term "ETECSA." In fact, the most recent change to the ETECSA page in Wikipedia was made on August 12, so the non-news alert is two months old.

-----
Update 11/25/2015

Here is another type of Google Alert failure:


This appears to be a link to a post on the BN Americas news site about a service outage at Cuban email provider ETECSA, but it is not. Instead, it is a link to a post on a spam site called WN.com.

As you see below, the WN page has one sentence on the ETECSA outage, hidden in a sea of spam links and illustrated by a couple dancing on the beach. Can you find the link to the BN Americas article?


Can't Google come up with an algorithm or black list to filter out this sort of "news?"

-----
Update 1/31/2016

I have a longstanding interest in satellite Internet connectivity for developing nations, so have been watching SpaceX's attempts at recovering booster rockets. They tried for the third time to recover a rocket on a drone barge at sea on January 17, but failed. I watched the launch webcast and posted a note on the failure the following morning.

Since I am interested in the topic, I have subscribed to "Musk-Satellite" Goggle Alerts, and at 10:08 Sunday morning, I received one on a Washington Post article entitled "Elon Musk's SpaceX to attempt another rocket landing. This time with a twist."

Judging by the title, the article had been written before it was known that the attempt had failed, but it now says the landing failed and includes several tweets time stamped after 10:08 AM. Evidently, the story was revised after it was first posted.

But, that was just the first Google alert -- I have received 158 more since then, many of which link to two or three articles on the failed recapture. I've looked at several of these articles -- they provide no additional analysis.

They are slowing down now -- I have only received two so far today. By the time these Alerts stop, I will have received links to hundreds of redundant articles, and thousands of ad impressions.

-----
Update 3/12/2016

A blog copied everything I posted on my blog on the Internet in Cuba for years. Both blogs are on Google's Blogger site. Can't Google detect that sort of thing? Are they incented to do so -- I guess they make money on pirated click-bait ads.