A Review of "The Discovery Engine" - A Short Story About the Future of Content Generation

In December 2016, Ben Halpren and I got in a discussion over the future of publishing.

Ben Halpren - I feel the publishing industry has created a model where they try to extract maximum value out of their human bloggers and treat them like assembly-line workers. There's not a lot of value in "aggregating" stories. It's just publishers trying to eek out a slice of the pie. A lot of this work will certainly be automated in the future.

I feel like the quantity over quality model that Internet publishers have fallen into is flawed for the business and the consumer because there is so much replicated work and it does not take advantage of economies of scale very well. In the future, there may be fewer overall writing jobs, but they will leverage human creativity much more than the current landscape.

Before the content-farming blog jobs totally go away, they will trend in the direction of editorial oversight into a more automated process. This part will probably vanish eventually as well.

Tariq Ali - I agree completely that the current ecosystem is perverse and unsustainable. To add a bit more content to this comment though, you would probably be very interested in the Content Shock hypothesis, which argues that the users' demand for more content will ultimately halt (meaning that incumbents who are 'first to market' with content ends up dominating, while new contenders who are trying to push out content gets ignored). In November 2015, the author claimed that content shock has arrived. Surprisingly, the author believe that one way for marketers to handle "Content Shock" is to mass-produce content, thereby drowning out the competitors, and believe that this mass-production will likely be done by computers.

If this content ecosystem does collapse, then the amount of text generation (human or automated) will probably decrease. It's an open question whether the humans that remain in that line decide to embrace automation (and use the content wreckage as corpus) or to shun it.

The current "quantity over quality model" that exists within content generation exist primarly due to Google's algorithms favoring more content:

While one high-quality article might drive a thousand shares, 10 articles that drive 120 shares each is more. Replace shares with traffic or conversions. It’s the same concept. In this way, Google is actually encouraging us to commoditize our content in lieu of creating great content, whether it’s purposeful or not.

Or, to put it more plainly:

Winning in digital media now boils down to a simple equation: figure out a way to produce the most content at as low a cost as possible.

This does not sound sustainable, to me or to Ben. The only reason consumers haven't yet given up on the Internet though is our reliance on various curators (human aggregation, search engine bots, machine learning recommendation systems, etc.) to tell us what content to consume. But what about content producers? What happen to them if they are unable to produce enough content that appeals to the curators? It seems that they have to shut down. And when they do, the content ecosystem would implode.

There is, of course, an alternative viewpoint that contrasts with both Ben's and mine's, one that argues that the content ecosystem is sustainable, that Content Shock is no problem at all, and that far from the ecosystem collapsing, is bound to explode and proliferate.

This viewpoint argues that major content publishers will learn how to take advantage of "economies of scale" by reusing existing works. Content publishers will also learn how to 'personalize' content to match the whims and desires of the people who are reading it, thereby keeping them engaged. The content ecosystem will adapt, and the process of adaptation would lead to a new 'golden age' of human creativity.

That viewpoint is best expressed by the short story "The Discovery Engine".

"The Discovery Engine, (or, Algorithms Everywhere and Nowhere)" is one of two short stories written by Professor Christopher Ketly in the peer-reviewed article Two Fables. This short story imagines a world where the "myth of the Romantic Author" died. As a result, the US government reformed intellectual property laws, allowing for major corporations to remix and reuse content in brand new ways.

[A]fter all the big media companies had figured out how to Spotify everything, there was not much reason to hemorrhage money into controlling media, when the game was to extract value in other ways: discovering, sorting, searching, repackaging, and repurposing. Copyright enforcement fell away; with everything open, or so multiply licensed it was impossible to track, it became clear that real, new growth was in repackaging and reselling information rather than investing in the creation of new information. New information had margins way too low to invest in anymore, even if novelty still drove the market more than ever. Something about the old argument that supporters of copyright terms used to make—that copyright was an incentive for people to create—no longer seemed to even make sense. Innovation certainly had not died, nor had the music.

Innovations included, but were not limited to:

Repackaging public domain and 'orphan works' for modern-day audiences, through clever marketing techniques (Example: "Who in their right mind would sit through an entire Cassavetes movie when it was possible to watch it unfold as the crazy mashed-up backstory of an X-Men film?")
Using English-to-English machine-translation algorithms to change the readability of texts to match a user's desires
Personalizing manuscripts to tailor to an individual's personal biases and beliefs (for example, renaming characters and locations)...but not too much, because you still want people to discuss with each other about the same reading experience(s)
Programatically adding 'dark patterns' to books as people read them (such as cliffhangers, targeted marketing emails, and "literary versions of clickbait")....with the expressed purpose of encouraging people to keeping reading
Reusing exisiting assets to generate new versions of the same story to be resold at different times (similar to the concept of reusable compontents in programming)
Increasing discoverability of existing literature so that people can wind up finding assets that can be reused
...or skipping the discovery process and using RNNs to just generate the stories outright

Human creativity has been leveraged, and probably even expanded, in this dystopian future. Instead of tediously trying to come up with new words to express what has already been expressed better before, you can simply reuse what already exist and then tweak it to match your specifications. Creativity simply moves to a higher level of abstraction, similar to how programmers move from low-level languages (like Assembly) to higher-level languages (like COBOL). 'DRYness' has been applied to literature.

And there's no sign of technological unemployment. Instead, the content economy is itching to hire more people...

"Cool hunters" find the works to be repackaged and repurposed
"Lit geeks" identify current fads and trends in literature, enabling corporations to tailor their generated books to a mass audience
Privacy violators spy on people and gather data about their reading habits
Data scientists analyze the data and design the algorithms necessary to generate the stories
And, of course, marketers develop the initial ad campaigns to acquire potential readers

However, writers are nowhere to be found in this golden age of human creativity except in an aside describing the creation of a new TV series ("The Game of Thrones reboot"):

Writers were employed by the system to check these texts, but not to write them, just to tweak and improve them, insert clever jokes and Easter eggs, to look for the giveaways that the algorithms could not spot.

I find "The Discovery Engine" to be (a) a pleasant read and (b) a plausible prediction for the fate of all content generation (both human-made and machine-made). I do quibble with two of the author's points though.

First, while RNNs certainly have their place as part of a "toolkit" in text generation, I highly doubt that they have the potential to generate coherent stories by themselves. I could imagine a world where RNNs generate hundreds of different stories and then a secondary algorithm filters through the resulting junk to find one or two stories that may be worthwhile. Even in this imaginary state though, I'd still see a fleet of data scientists behind the scenes, trying their best to massage the data and the algorithms to produce the best possible outcome.

The second point I quibble with is more philosophical. One subplot in this sci-fi story involved society's reaction to multiple versions of the same public domain work being published by major corporations, leading to a reactionary (and doomed) impulse by a few scholars to recover and preserve the "original" works.

But in the real world, I don't think anybody would care. After all, the story of Cinderella has been rewritten countless times, with different versions personalized to different regions of the world. And we can sensibly discuss about the different versions of Cinderella (for example: arguing whether The Brothers Grimm's version is superior to Walt Disney's version) without getting ourselves confused. Ideas has always been cheap; it's the execution of those ideas that matters.

The New Work departments in the publishers and movie studios were eventually dwarfed by the Repurposing departments; it became increasingly impossible to sell a book or script just because no one had ever written something like it before. It turned out, actually, that this was always false, and it was far easier and cheaper to update Trollope than to pay Franzen for his novel on the same topic.

Backlink:
This article was originally published on May 28th 2017 on my personal blog.