Content filtering

Content filtering refers to an automatic system put in place to process large volumes of data and take action on any content that meets certain criteria. Publishers often use text and media-filtering solutions to handle the bulk of the user-generated content on their site. These systems are often put in place to filter content such as adult and illegal filesharing as well as the sale of firearms, drugs, alcohol and tobacco.

Important: The violating content does not have to be hosted locally. Even linking to external sources that host it is considered a violation. For example, a publisher framing movies hosted illegally on a third-party site is violating the Google Publisher Policies.

Developing an in-house solution

Many publishers choose to develop their own filtering system. This decision can have the following benefits:

Text-based filtering can be relatively easy to code
It is often significantly cheaper than commercial solutions
The publisher knows their site and users best and can anticipate policy issues better than anyone else

Following are a few ideas and suggestions to consider when developing an in-house text-based solution.

Creating a list of keywords

To filter text, the system needs to rely on a list of keywords made up of individual words as well as word combinations. Creating this list can be done in a number of ways, depending on the type of content, it’s volume on the site and the publisher's available resources:

Compile your own list of words and phrases that you wish to filter. You can use your own intuition or get some help:
- Ask your employees to contribute
- Reach out to your users for help
- Use Google Ads: Keywords tool
- For additional inspiration take a look at websites that host undesirable content (adult and/or filesharing sites for example), and find out which keywords show up frequently on these.
Code your own automatic keyword scraping tool:
- Use search engine data to go through all pages on a site
- Retrieve a list of unique words and word combinations on it
- Keep the most commonly used keywords and discard the rest. Don’t forget to eliminate common articles and words like ‘a’, ‘and’ or ‘the’.
- Output as a text file
- Repeat the above for any number of sites until you are satisfied with your list, and you’re done.
- Important: Scraping other sites and using their content as your own is against the Google Publisher Policies and the Spam policies for Google web search and might also be illegal and/or unethical.

Assigning weights

All words are not created equal, and some keywords are worse than others. You should therefore consider assigning different weights to different terms.

For example, adult filters in English should weigh the word ‘porno’ higher than ‘sex’. While ‘porno’ is almost exclusively related to content that is not family-safe, ‘sex’ may also mean ‘gender’ - depending on the context it is used in.

Also consider words that are safe on their own but put together with another word might indicate something else entirely. The word ‘pictures’ for example is innocent enough, but ‘teen pictures’ would often refer to pornography.

The filtering process

There are two common approaches when dealing with content filtering, and it is up to each publisher to decide what makes the most sense for their site.

Method 1 - User generated content is scanned after it is displayed on a page:

Scan the content
Flag if it meets filtering criteria
Disable ad serving on the page hosting said content
Manually review content:
1. If it is safe, enable ad serving and adjust filters
2. If it is not, make sure the content is not displayed on pages that include ad code

Method 2 - User generated content is scanned before it is made available to users:

Scan the content
Flag if it meets filtering criteria
Queue it for review or reject it outright
Manually review content:
1. If it is safe, show it on ad serving pages and adjust filters
2. If it is not, disable ad-serving and show it or reject it

Commercial solutions in a nutshell

There are a number of services that provide content filtering, even a few that specialize in filtering specific types like adult or copyrighted content. There are also crowdsourcing platforms that create a bridge between publishers and users looking to make easy money on the Internet. The best way to approach this is to do some market research on the topic and decide on the best solution for the service you are providing. Try looking for sites that review software and see what kinds of user-generated content filtering systems they are recommending. After having all of this information at hand you should decide on the best solution for you based on the product’s score, its unique features as well as its pricing model.

Was this helpful?

How can we improve it?

Content filtering

Developing an in-house solution

Commercial solutions in a nutshell

Was this helpful?

Need more help?

Try these next steps: