A Lot of Knowledge is a Dangerous Thing

Fighting Information Overload With AI Models

Cogitech
10 min readJun 24, 2023

A Deluge of Nothing in Particular

Navigating an endless ocean of information has become an everyday occurrence for any user of social media. We are bombarded with a volume of articles, posts, data and opinions that make even robust AI systems shudder. Everything from the latest news from all across the world, to your sister’s latest pet pictures, to recipes with truly dubious amounts of sugar in them, it can be hard to even begin parsing out the weed from the chaff. Getting to any sort of meaningful grain might be an even more insurmountable task. But our ability to keep a handle on the world’s progress nonetheless depends heavily on learning what we can from every stream of information we can get our screens on.

How the readership has reacted to this issue has had its ups and downs. On the one hand, we have never been so informed about current events, never so on top of the latest buzz and never so well stocked with recipes as we are in today’s world of the information superhighway. But the costs have likewise been substantial. Not being able to digest everything on offer fully, the offering itself has reduced itself to fit the attention span it must. We simply can’t afford the time to read an entire blog post on every single event. So articles are digested down to headlines, interviews to soundbites, your sister’s pet pictures down to a single selfie that she probably stole off an Instagram influencer’s how-to post on poses.

And all of that matters. The pressure to minimize content and maximize effect takes its toll on our social feeds regardless of who the author is. It is that systemic drive towards fitting in a particular mold to get as much engagement from the viewers that leads to fewer and fewer original ideas and unique voices in our lives. The color and creativity is washed out to make way for business-proven strategies. Nuance and careful correctness replaced by sensational headlines that distort truth and neglect fact. Instead of relying on our news feeds to inform us, it functions only to distract us for a fleeting moment that it takes to get our likes.

Readers cope with the ever-increasing deluge of information by narrowing the scope of what they consume. It became impossible to read every newspaper, so you just read one or two. It became impossible to follow all the important voices in a discussion, so you just follow one or two. As the breadth of incoming information widened, people’s attention narrowed down to the few sources they felt most inclined to follow. These turned out to be outlets and voices that tended to agree with them. It became a self-perpetuating circle of pumping out more information, only for the reader to narrow their reading even further, until the discourse on social media ended up siloed into echo chambers of homogeneous opinions [1].

The danger with this style of media consumption is that it leaves central preconceptions and prior bias unchallenged. No matter what one’s ideology or beliefs are, they are far less likely to be exposed to critical and thought-provoking media if their information input consists of only the media that already supports the same conclusions. And therein lies the great tragedy of social media — in our quest to reach out for information we are instead cut off from it, many times not even aware that we’re falsely presented with only one side of the coin [2].

A Way Out of Here

It is in these sad circumstances that we see many attempts to counter the rising sensationalism and bias take shape. There has been a lot of research put into the exact mechanism that causes the ongoing degeneration of media content, but solutions remain elusive. One of the main issues with simply fixing the current state is that it arose organically, based on user preferences and how those fed into incentives for content creators. People want to be informed, so they demand more information, but often overestimate their own ability to process the sheer volume on offer. So they run to sources that condense it. It is one of the most natural human properties that we are loss-averse and it manifests itself in trying to hold onto too much, compromising the integrity of everything we consume in the process.

Trying to reverse such self-reinforcing organic trends is extraordinarily hard. After all, you’re trying to help the very people who are running in the opposite direction. It aggravates the situation that we don’t really have a singular solution ready that would tackle the issue from every angle. You’re restricted to making trade-offs between offering even more information to a reader, or cutting away the lower quality information. Both of these are unappealing options. The former because it only aggravates the information fatigue suffered and the latter because of the same loss aversion that makes people hold on to all their media sources in the first place.

As a compromise, the current effort normally focuses on solutions that add content metadata. Instead of trying to interfere in the media consumption directly, extra information is included alongside the content, informing the reader about its quality. This is an inherently dangerous proposition, as you’re essentially forcing the values of the solution on the consumer. There is no objective standard of bias or sensationalism that you can apply in real-time to every piece of media, so these solutions tend to be little more than an unknown third party telling you not to read something because it’s “bad”, for their particular, often opaque and always subjective, definition of “bad”.

In hindsight, it turned out that trying to force absolute labeling on particular sources of information doesn’t help media literacy in general. People who agree with the labels only double down on their choices, and people who disagree simply ignore them. This is the case with a lot of current solutions that try to label sources as “biased” or “political”; they end up reinforcing the very echo chambers they try to work against [3].

Enter the Dragon, Kind of

We decided to take it in a somewhat different direction. Instead of trying to work against the grain when it comes to people’s reading preferences, a helpful tool should seek to inform and expand the horizon. Instead of forcing a reader down a particular path, it should try to improve the road signs on the way and provide a better map of the information available out there. From that set of expectations, we created a system that lets us reach across the Internet and show relevant media that a user should be aware of alongside whatever content they are currently reading.

It was in this landscape that we decided on a set of principles to design our solution around. First of all, we had to be able to process and provide metadata about a massive amount of media. Trying to narrow our focus or only admit certain articles would make sure we’re not actually tackling the real issue. After all, most people know that New York Times or Wall Street Journal are relatively reputable sources, whether or not you agree with their positions on everything. But the real problem is in the long tail of half-known sources and fringe voices that should sometimes be elevated and sometimes be taken with plenty of grains of salt. All of these inputs are something we need to consider and offer up data to the reader about.

Further, we eschewed static labels. Whatever they might be, labeling seems to be ineffective at providing the sort of information that readers need. So we were looking for suitable alternatives. For something that we could apply to basically everything, while committing to basically nothing predefined. What we came up with was similarity scoring. Instead of treating every piece of media as a separate issue to consider, we treat the entire information landscape as a single field of data that needs to be processed and presented as a whole.

Similarity connections between documents, colors representing different publications they’re sourced from

To do that, we would end up using a whole lot of math.

Artificial Mathtelligence

Okay, that one was a bit of a stretch. All Artificial Intelligence and Machine Learning is basically just building statistical math models, so we’re not breaking new grounds on that front. What we are doing is using a Natural Language Processing technique called Latent Semantic Indexing to process our data. The only real interesting term for our use here is LSI and how we’re using it to get our similarity scores. LSI is really just a term that describes using semantic analysis to calculate cosine similarity between documents — i.e. any given piece of text media [4].

Now that’s a lot of math terms to take in all at once, but don’t worry, they won’t all be in the quiz at the end. The goal is simply to explain in rough terms what the idea behind the algorithm is, so how exactly we’re using linear algebra isn’t really that important. Also, there is no quiz at the end. Unless you’re weirdly into self-testing, in which case, shine on, you crazy diamond.

It all boils down to the semantic idea that words which appear in similar places in a sentence carry similar meanings. If that is true, then we can understand a lot about a particular word by looking at the context around it. We start by breaking down every document into words, removing “useless” words such as those that are too common and those that are too rare. This gives us a set of words we can reasonably expect to gain some useful information from. That information comes from cosine similarity. It’s a fancy name for assigning numbers in a coordinate grid to words, putting that together on a document level, and calculating the angle between any two documents on that grid.

Imagine a neighborhood in a city. The documents are represented as the buildings and they are placed in them based on what words they’re made from. Some of them are studio apartments, some are semi-detached duplexes and there’s a few garages thrown in there, based on what words best describe them. When you stand in the corner of the neighborhood, the document-buildings are in an arc in front of you. If you point at any two buildings, the angle between your arms tells you the similarity between them. More accurately the similarity is its cosine, but that’s just a math transformation.

Neighborhood example, with β denoting the angle between the two buildings

What’s interesting about this method is that it doesn’t depend on where exactly those buildings are. You could have one building right next to you and the other on the far end of the neighborhood, but the angle doesn’t change with distance. We intentionally ignore the real make-up of words in the documents and only focus on how similar to each other they are via the angle calculation. After all, that’s what we’re really after and that’s the relationship that will let us show relevant content to the reader.

All Together Now

Using this measurement we can build groups of documents that look similar to each other and organize them into metadata. That data can then be attached to any content that’s being shown in the browser. Having such a wide overview of any particular topic is an effective means of exposing more context and a breadth of topic coverage from outside the social media organized echo chambers.

Piercing these boundaries between isolated groups is not a simple task and our solution is one tool in a greater drive towards improving news literacy. We are unlikely to achieve results if we try to directly fight the systems of incentives that brought about the current era of clickbait. After all, they exist for a reason and as a coping mechanism by the readers. Instead, we propose that our approach, based on accepting user preferences and offering relevant additional context, is a far more productive method than trying to forcibly refocus a reader’s attention or cram more information into their attention span.

It is our hope that in these times of rising anxiety about AI, chatbots and other computer-generated content we can show how these methods can be a force for the better. A useful tool in the arsenal of news literacy. As we are forced to face an increasing volume of shared content across the internet and social media, this type of literacy will be critical to educate readers on how to assess fact from fiction. And with our help, we hope, that will be a slightly more approachable task.

Hey, there is a quiz at the end after all! And the question is — which is the best browser extension you can use right now? Well, probably one of the ad blockers or a password manager. But we come right after them. Try our current solution at infold.ai and give us a follow on Twitter for future updates.

by Primož Jeras

[1]: The echo chamber effect on social media, Cinelli et al, 2021

[2]: Americans Who Mainly Get Their News on Social Media Are Less Engaged, Less Knowledgeable, Pew Research Center, 2020

[3]: Tactics of news literacy: How young people access, evaluate, and engage with news on social media, Joëlle Swart, 2023

[4]: Introduction to Information Retrieval, section 18.4, “Latent semantic indexing”, Manning et al, 2008, Cambridge University Press

--

--

Cogitech
Cogitech

Written by Cogitech

Our team delivers advanced, AI-driven software solutions tailored for complex problems, focusing on efficiency and innovation.

No responses yet