Tagging 101: What is Auto Classification?Auto classification is a methodology for scanning the contents of a document and automatically assigning tags, or descriptive labels and keywords, to the document. With the support of taxonomies and knowledge graphs, it indexes content into appropriate categories and classes to provide rich metadata that eases the content management process. WATCH OUR 101-STYLE WEBINAR
TABLE OF CONTENTS
The common challenges users experience in their CMS
What’s the solution? Auto classification
Auto classification with knowledge graphs
The benefits of implementing auto classification
Adobe Experience Manager
Quick rundown of the benefits of auto classification:
- faster to find relevant documents and content
- more accurate filtering of non-relevant information
- consistent tagging based on controlled enterprise vocabularies
- automatic or semi-automatic content tagging based on text mining
- consistent tagging of multilingual content thanks to controlled vocabularies and business glossaries
- standards-based metadata of high quality
- tools that can interlink structured and unstructured information across data silos
While these CMS and DXP have reliably helped the organizations that use them, they are not always the easiest to maintain simply because they cannot keep up with the volumes of different material that an organization produces. This is where auto classification – or semantic tagging – comes in, because it allows users to automatically categorize and tag content with rich descriptors that enable finability of content in search engines or recommender systems.
Table of Contents
The common challenges users experience in their CMS.
Too much unstructured content to organize
The same can be said for multilingual companies where documents are in different languages and even numerical spreadsheets rely on linguistic labels. In many cases, organizations who have offices across the world may need to pour their resources into translating material that is necessary to develop company-wide reports or projects.
Content in CMS is hard to find
Table of Contents
What’s the solution? Auto Classification.
Metadata is data that describes other data. For example, a document will have metadata describing its file type and size, date of document creation, author(s), the dates of any changes, etc. By defining these specific elements of a document, it is easier to find, use, and manage the document. For more information about metadata, check out our complete guide here >
More information about the auto classification workflow can be found by jumping ahead to the PoolParty infrastructure section >
Altogether, auto classification serves as a way to annotate unstructured text and data so that it is organized and machine readable. The annotated documents are synced back to the CMS with the semantic metadata, and once it is machine readable with this semantic metadata, users can build robust knowledge hubs, search engines, recommender systems, etc. on top.
Tags and semantic metadata explained.
Table of Contents
Auto Classification with Knowledge Graphs.
Much like the boxes which identify the relation between item to category to room in the house, semantic tags that are mapped in a knowledge graph identify relationships between concepts, terms, documents, etc. and the contents within those documents. With semantic tags, you can bundle these relationships together by adding labels of synonymous terms that make search platforms function smarter. When the semantic metadata is stored in a knowledge graph, documents can be indexed and queried better, allowing for precise user search.
In a CMS, documents can be tagged with authors, topics, authoring dates, etc. If a user is looking for a document by one particular author, all those documents tagged with the same author will be retrieved so that the user does not have to sift through the whole database. In addition, users can more easily locate documents based on the extracted topics and classification, e.g., when searching for news about ‘renewable energy’ vs. event articles about it.
All this metadata helps the user become better oriented to their CMS so they can use it more efficiently.
Table of Contents
The benefits of implementing auto classification.
Auto classification breeds accuracy and efficiency
Furthermore, manually tagging entire databases, file by file, often involves a lot of people which may lead to many errors and inaccuracies. In many cases, two individuals categorize the same information in two different ways or miss data simply because they could not cover it all by hand. A leading company in consulting and management has experienced this exact problem, where they sought out PoolParty’s PowerTagging to fix it. Read more about this success story here >
Auto classification not only improves knowledge management workflows by reducing the amount of time spent curating metadata by hand, it also improves the quality of the metadata that is produced with it. The tags are made against the taxonomy to define clear rules for future classification so that the same rules can be used consistently for every piece of content. Since the behavior is pre-configured and approved ahead of time, there is less room for error or inaccuracy.
Rich metadata and data governance
Active semantic metadata highlights missing and incorrect data, but also helps improve the quality of analytics by automatically correcting and enriching the data. In an automated tagging workflow, the pre-configured rules defined for the tagging behavior ensure that data is compliant and any errors are obvious. Besides helping comply with regulatory and business requirements, metadata management helps assess the impact of a change within a data source. It also supports accountability for the terms and definitions of a business glossary to lead organizations towards the development of a standardized data model.
These are all components of data governance, which Gartner defines as “the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption and control of data and analytics.” Organizations should strive to achieve data governance in order to comply with legal regulations as well as internal policies and procedures. Metadata management (and thus data governance) helps security and risk professionals be ahead of problematic scenarios by classifying data according to risk and security needs. It facilitates data lineage, impact analysis, and data management to reinforce privacy requirements.
Intelligent search and personalized content
On an ecommerce website, for example, a user who searches for “sweater” could also be recommended products that contain the keywords “hoodie” or “cardigan” because they have been bundled together with tags.
The categorization aspect of auto classification takes it a step further by putting these terms relating to sweater in appropriate categories or classes. If we consider a user’s filtering path on a clothing website, they might first toggle through the different product categories: Shirts, Coats, Dresses, Pants, Shoes > from Shirts they can select Graphic Tees, Sweaters, and Blouses > from Sweaters, they can toggle different options such as size or color. These products have all been tagged with appropriate keywords so that the correct products show up with each narrower category. This is precisely the behavior of a taxonomy that is organized on the backend which couples the bundled tags and classes together so that the user has a seamless experience using keyword search in text fields, or filtered searches from the product menus.
In another scenario, picture a company who is responsible for the research and development of a new drug. This project requires extensive reading and writing about the Delta Variant using material from their CMS which contains thousands of documents of research. A simple search for “Covid 19 mutation” returns all the literature in the database containing those keywords including news articles, internal HR updates for home office rules, etc.
The obvious problem here is that most of this information is invaluable to the research either because it has nothing to do with scientific information or it is very outdated in the case that it retrieves information from the early stages of the virus about the Beta Variant. A CMS search using semantic tools could make assumptions about the user’s intent because the content has been tagged and classified. Recommender systems or semantic search platforms (which are built on auto classification) are powerful in this case, because they can retrieve relevant and precise information based on filtering and confidence scores. In other words, documents which have been classified under the Delta Variant and tagged with recent publishing dates will get higher confidence scores and be shown at the top of the list of results.
Auto classification helps the user find precise information in their CMS more easily so that tedious workflow is greatly reduced.
Table of Contents
Setup and infrastructure:
How to integrate PoolParty into your CMS.
PoolParty PowerTagging, the comprehensive solution to this wish list, comprises important semantic features:
Classification of documents into domains
The Semantic Classifier uses knowledge graph and machine learning to precisely classify documents into their respective domains. This product only requires a small training dataset to understand how to classify your documents automatically. The classified documents will be sent to their correct domains in order for the tags to be extracted correctly.
Text mining and natural language processing
Extract terms, phrases, and entities from unstructured documents. Use algorithms and a rule engine to disambiguate discrepancies in regular spoken language or unstructured text. E.g. the machine understanding Apple refers to the tech company and not the fruit. The extractors can be set up according to specific knowledge domains. Read more about NLP in our white paper >
Categorization of tagged documents into the taxonomy
Tagged documents can be sorted into the different concept schemes of the taxonomy. With the support of knowledge graphs, smart search applications and recommender systems can be built by recalling the concepts defined from the annotated documents.
A simple breakdown of the PoolParty setup is as follows:
Table of Contents
PoolParty integration with SharePoint.
The following are critical drawbacks of SharePoint’s native search functionalities:
- User search experience depends heavily on how the feature was set up by administrators.
- Search is limited to the database the user is working with.
- Without additional customization, search results cannot be filtered by any category other than the age of the document.
- Unfortunately, SharePoint’s out-of-the-box search functionality is not capable of returning results that are timely, comprehensive and relevant.
Backed by automated tags in PoolParty PowerTagging, SharePoint’s organization and search retrieval is vastly improved. Read more about PoolParty’s ready-made integration with SharePoint here >
PoolParty integration with Tridion.
PoolParty has partnered with RWS, the creators of Tridion, to enhance Tridion’s tagging capabilities and deliver the intelligent solution.
PoolParty integration with Adobe Experience Manager.
AEM Tags are not capable of the following:
- Synonyms and alternative labels and terms are not supported by default
- Polyhierarchy is not available
- Tags do not have their own unique ID, meaning if you move tags around or change their details, the system could not identify their relations or redirect them correctly
- In PoolParty Thesaurus Server, build and maintain a taxonomy which allows you to easily tag and control your content vocabularies
- Using Mekon’s Semantic Booster, sync the taxonomies you built to AEM Tags
- The tagging process is made much easier and smarter using PoolParty’s sophisticated classification algorithm