Docuvela Blog

Sharing our knowledge and experiences with the content services community

Getting Started with Generative AI in Content Services Without a Big Investment

Sep 21, 2023 | AI | 0 comments

In today’s data-driven world, harnessing the power of Artificial Intelligence (AI) is no longer a luxury but a necessity. For companies with new or existing content services implementations, the prospect of integrating AI may seem daunting, with concerns about cost and complexity. With Generative AI near the top of the hype cycle, it’s also difficult to keep up with w hat could be revolutionary vs. what is just hype. In this post, we will look at where organizations can start with AI, specifically Generative AI, in a cost-effective manner that does not require a large investment.

To make sure we are all starting with a common understanding of the key terms, below are the basics.

Generative AI technology accepts inputs and creates new text, images, audio, code or other generated output.
Foundational Models are “the brains” of these Generative AI solutions. Foundational Models are pre-trained on vast amounts of data, and are designed to produce a wide variety of outputs.
Large Language Models (LLMs) are the most common type of foundational model accepting natural language inputs, understanding and interpreting the text and generating original content in a human-like fashion.

LLMs have exploded onto the scene in the past year, with OpenAI’s ChatGPT and Google’s Bard dominating headlines. The power and versatility of the LLM chatbots has been well documented, but it’s hard to know how and where organizations should get started.

Concerns Over Data Protection

Content Services platforms typically house high value content that should not be exposed outside the organization. There are real concerns about providing sensitive information to OpenAI, Google or any other AI platform. In the free tier of ChatGPT, any data you submit can be used by OpenAI to monitor and train the underlying model. However, per OpenAI’s privacy policy, data submitted via the paid API is not used for model training by default.

Organizations may still be hesitant to utilize ChatGPT’s API within their content services applications. In this case, we would recommend exploring ChatGPT Enterprise which offers enterprise-grade security, improved speed, and more. For customers on the Azure Cloud, Azure OpenAI Service offers similar features. The key to both of these offerings is that your data is sandboxed to your organization.

We had the opportunity to explore this topic with Scorpius Cybersecurity, a cybersecurity and automation managed service provider, and their thoughts are:

Ensuring that your company’s private data is secure while using AI is a necessity. Since so many technology companies experience a data breach at some point, ensuring that only your organization has access to your private data is crucial. Utilizing the enterprise version of ChatGPT or any other AI service which does not store your data outside of your organization’s control ensures that your data is just as secure as it would have been without the usage of AI services.
Scorpius Cybersecurity

Overall, we see the concerns here very reminiscent of the concerns over cloud computing 10 years ago. When working with clients then, there were many concerns about running sensitive systems in the cloud, with many organizations flat out saying they would never go to the cloud. However, now in 2023 it’s hard to find an organization that doesn’t utilize the cloud in some fashion. We would predict that a similar shift will happen with the adoption of AI tools as vendors provide more and more options to address the data privacy concerns. The primary difference is that the shift to cloud-based AI providers will happen much more quickly since within the tech community there is a lot of interest and investment in AI and companies are likewise under a lot of pressure to leverage AI to improve the speed of business.

Things to Think About… Current Limitations

Once an organization gets past the data privacy concerns mentioned above, the next logical step is to figure out how to get the AI engine to respond with useful information using the context of one or more of my documents. When thinking about this topic, it’s important to understand the concept of a Context Window. Simply put, the context window is the amount of text that the AI will “remember” over the course of a conversation. The context window is typically expressed in a number of tokens, with a token being approximately 4 English characters. See OpenAI’s description of tokens here. Using OpenAI’s assumption that 100 tokens is approximately 75 words and that a page of text contains approximately 500 words we can estimate the context window in pages as per the table below. Keep in mind that these are conservative estimates as most documents are not wall-to-wall text.

Model	Max Context Tokens	Approximate # of Pages
OpenAI GPT-3.5 Turbo	16k	24
OpenAI GPT-4	32k	48
Anthropic Claude	100k	150

Based on the data table above, larger documents and document sets still do pose a problem, but we would predict that context limits will continue to grow exponentially. Rumors have also surfaced that OpenAI will soon provide a stateful API that remembers past conversations as well as offer a 1 million token context window in the near future.

What About Larger Documents or Many Documents?

What if we want to have Generative AI work with a very large document, a subset of documents, or perhaps thousands or millions of documents? Imagining a chatbot that can automatically respond based on the context of all of an organization’s documents and point users to certain documents when responding (essentially replacing the typical search features in a typical Content Services implementation) is a very appealing use case to improve knowledge workers’ productivity. However, the LLM’s context window makes this impossible.

At first glance, it may seem like the recently announced fine tuning API from OpenAI is the best way to train an LLM like ChatGPT on a set of documents. However, looking more closely at the documentation, the fine tuning allows for administrators to improve how ChatGPT responds, and enforce consistency and tone – but it will not learn a large set of documents in order to respond intelligently in context. So, while fine tuning ChatGPT will not enable organizations to replace content search features, it would help in certain scenarios. We could imagine a helpdesk chatbot being fine tuned to respond to certain requests with references to certain document or work instruction names. For example, the fine tuning may enable ChatGPT to point users to onboarding documentation when an employee asks a basic FAQ question about payroll or benefits.

What if we want to go beyond fine tuning? Having an AI that has full context over thousands or millions of documents could be a revolutionary disruption to many traditional content services systems. Barring any features that OpenAI, Microsoft, Google, or Anthropic release in the future, which given the pace of AI is certainly possible, there are a few tools currently available that warrant future research in this area:

Unstructured.io – Unstructured is an Extract/Transform/Load (ETL) that takes unstructured data from files in a variety of formats, transforms it into AI-ready data, and then loads it into an LLM. This tool was mentioned in Deep Analysis’ article How to Train your LLM and we agree that it’s a promising tool that warrants more research. A proof of concept connecting a Content Services platform like Alfresco or Documentum to Unstructured would be a very interesting project.
LangChain – LangChain is a development toolkit for Python or JavaScript that allows for building context-aware AI applications. Per the documentation, LangChain seems to be a very powerful tool that would also be interesting to proof-of-concept with a content services tool. Our first impression is that LangChain would be a more DIY approach vs. using Unstructured.
Something else? Given the pace of AI innovation in 2023 and beyond, we would not be surprised if additional tools pop up in this space.

Using AI in Content Services Without a Big Investment

Based on what we know today (September 2023), can your organization get started with AI without a large investment in money, time, and other resources? We think the answer is a resounding yes. Here’s what we would suggest:

Start building integrations between AI and ChatGPT using the API, which is fairly inexpensive. Do not encourage users to use the ChatGPT interface provided by OpenAI due to privacy and data collection concerns.
Utilize ChatGPT Enterprise or the Azure OpenAI Service if sensitive data is involved.
Build AI functionality into your content services applications! Limiting AI interactions based on the maximum context window for your chosen model will allow you to keep costs as low as possible. Cont a ct us to get started, we’d love to help you out.

Stay tuned to the Docuvela blog, LinkedIn, and Twitter/X as we dive into specific use cases that can be integrated easily into your Content Services applications without a large AI investment!

Docuvela Blog

Getting Started with Generative AI in Content Services Without a Big Investment

Concerns Over Data Protection

Things to Think About… Current Limitations

What About Larger Documents or Many Documents?

Using AI in Content Services Without a Big Investment

Related

0 Comments

Leave a ReplyCancel reply

Docuvela Blog

Getting Started with Generative AI in Content Services Without a Big Investment

Concerns Over Data Protection

Things to Think About… Current Limitations

What About Larger Documents or Many Documents?

Using AI in Content Services Without a Big Investment

Share this:

Related

0 Comments

Leave a ReplyCancel reply

Discover more from Docuvela