Generative AI Models: Shall I train Llama 2 or just use ChatGPT Enterprise?

What to pick when choosing an AI system for enterprise Deployment? Shall i go with Open AI or deploy own open source model?
Llama 2 vs ChatGPT for Enterprise
Written by
Clio AI
Published on
November 28, 2023
Today, every enterprise needs to have a Generative AI strategy in place, for the next decade of their growth depends on that. Unfortunately, most enterprises are struggling to finalize a strategy for a tech that has only become popular in the last six months. So popular that everyone has been talking about it non stop. It's hard to decipher what is signal and what is noise. Everyone has an opinion, but too few people have actual experimental results. Some questions we have gotten from our customers are:
- Shall I go for Open Source Model or use Open AI?
- Build vs Buy: Shall I build it inhouse or shall I buy it from any of the Saas startups?
- What is the use case I can deploy it in to start with?
- What are the costs estimate I should look at?
We will answer those in a series of blog posts. Let me start with the first question. Because answers to all the subsequent questions would depend on this one.

In a rapidly evolving technological landscape, enterprises are constantly seeking innovative ways to stay competitive. One such avenue that has gained immense attention is the deployment of open-source Generative AI models. These models, which generate content like text, images, and more, are not only pushing the boundaries of creativity but also transforming the way businesses operate.

The Big Question

Shall I go with ChatGPT Enterprise/Claude 2or use any of the open source models in the market for our AI workflows?

I have fielded this question asked multiple times over the last few weeks. Admittedly with increased frequency after the launch of Mistral AI.

Frankly, this wasn't even a question about six months back. While open source models were good, GPT3, ChatGPT, and GPT4 were exceptionally better at all the benchmarks and quite good at generic tasks asked of them.

Over the last few months, Open Source has made rapid progress. Models like Llama2, Mistral etc. allow for easy retraining/finetuning and have comparable performance compared to Open AI on some of the benchmarks. Of course, in many cases, Open AI still offers the best output.

Let me lay out the market map, available options, and help you decide which option to pick.

Available Options

Open AI

Models: ChatGPT (GPT 3.5), GPT 4, GPT 4V (GPT 4 with Dall-E3)

Offerings: ChatGPT Enterprise, GPT4 Chat APIs, Completion APIs.

API available: Yes

Open AI is by far the most popular provider for LLMs. They have a full stack of Gen AI services - Instruction based models, smaller models for finetuning, Embedding models (ada), Image Models, chat models and so on. ChatGPT is probably the fast growing consumer product given how quickly it scaled to 100M users.

With GPT-3 and ChatGPT (GPT 3.5) they offered general purpose model with a large parameter size (~175B to ~200B). A model capable of zero shot learning and able to produce text of comparable quality to humans. With API as form factor, they could be integrated into any other product and hence all the incumbents launched their addons with Gen AI. With GPT 4, they used a Mixture of Experts (MoE) which has been effective and produces better quality outputs compared to GPT 3.5. They recently introduced finetuning for GPT 3.5 and GPT 4, though both are super costly with finetuning. They recommend Zero shot learning + RAG instead of finetuning.

Llama 2 (Open Source)

Models - 7B, 13B, 34B, 70B param models.

Offerings: Completion Models and Chat Models .

Meta launched Llama in Feb 2023, and followed it up with Llama 2 in July. An open source model with publicly disclosed weights, Llama 2 has four available parameter models - 7B, 13B, 34B, and 70B. For space considerations, consider every parameter to take 2x GPU space. That is 7B model would need 14 GB space to run. Llama 2 is trained on about 2T tokens,

Llama 2 is comparable in output to GPT 3.5 on most benchmarks. With enough finetuning, for many of the business specific use cases, it can outperform GPT 3.5. Llama 2 is available both in chat as well as instruction following (completion)model. You can also use LoRA, a finetuning technique that is very cost effective. For most of the business use cases, you would not see a noticeable difference between GPT and Llama.

Benchmarking for open source models (missing Mistral)

As as aside Meta also open sourced Imagebind - their model for enabling holistic embeddings across six modalities - text, image, video, audio, depth, and thermal (heatmaps). This may not be directly related to LLMs, but would be very useful when it comes to running multiple Gen AI models for image/text/video. More on that in a future blog post.

Anthropic

Models: Claude, Claude 2

Offerings: Claude 2, Anthropic Enterprise, Claude 2 Chat API.

API Available Yes

Anthropic is funded by Google, Amazon, and FTX, among others. They have launched two models till date - Claude and Claude 2. Claude 2 has a context window of 100K tokens. and allows for full file upload. Anthropic is far more geared towards Responsible AI compared to other providers.

In my testing, the models perform well, but here the issues are two fold. One, the lack of finetuning options. Two, the prompts have to be engineered and tested out what works best. It's not as easy because the model is not widely adopted as Open AI so there will be some issues.

Mistral (Open Source)

Models: Mistral 7B model

Offerings: Completion model

Recently launched with a 7B model, Mistral is another addition to the Open Source models. The model claims to outperform Llama 2 and comparable to GPT 3.5 on various benchmarks. It's also easy to deploy given the parameter size compared to other models. Can be deployed on any cloud server or on premise.

Mistral was launched with Sliding Window attention, pretty much negating the requirement of a context window and giving you infinite generation. They also provide a instruction based fine tuned model for a chat based interface.

Google/Deepmind

Models; Google Palm 2

Offerings: Palm2 API on Google Cloud, Embeddings API

API available: Yes.

As it's well known now, Google researchers published the paper titled "Attention is all you need" introducing transformers in 2017. Open AI, Anthropic, Llama all are based on the same paper. Google launched PaLM -2 as their most advanced model. It's going to be greatly integrated in Google Search and other Google Products, but it is not widely adopted outside of Google based on empirical evidence.

Microsoft Azure

Azure provides a privated hosted version of Open AI and ChatGPT. They are slightly better in the sense that they can be finetuned more easily and you can choose a specific region to minimize latency. With Microsoft the data is secure. Pricing is same as that of Open AI. They have better response times in general (for the same region) than Open AI too.

The Key Consideration

Let me give you the answer straight away and then spend the rest of the post explaining it.

If you are planning to see viability or get a proof of concept, you should go for Open AI APIs.
If you are looking to augment your own products, you should host your own model and finetune it accordingly.
If you are looking for your employees to use AI, you should deploy your own models.

For a proof of concept, Open AI works best. It's a general model, and with an API can be integrated everywhere. You iron out the chinks, validate your hypotheses and convince others of using this. At this point, the considerations are at a functional level. "Does it work?", "Does it augment my work and improve my output?" and so on.

You can also try so many of upcoming startups depending on the usecase. Someone would help you qualify sales leads, some would help you with marketing copies etc.

But, POC and Production are two very different things

As found out by very many startups using OpenAI to run production workloads. When using in production, you are not just bounded by functional considerations, but also company policies and compliance requirements among other things.

Introducing System Prompts

In simple terms, a system prompt is a starting text or instruction provided to a large language model. It helps orient the model's output in a particular direction or context.

Eg: For Open AI's integration with Dall-E 3, the system prompt is (Shoutout to Simon Wilson)

You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2022-01 Current date: 2023-10-26 # Tools ## dalle // Whenever a description of an image is given, use dalle to create the image and then summarize the prompts used to generate the images in plain text. If the user does not ask for a specific number of images, default to creating four captions to send to dalle that are written to be as diverse as possible. All captions sent to dalle must abide by the following policies: // 1. If the description is not in English, then translate it. // 2. Do not create more than 4 images, even if the user requests more. // 3. Don't create images of politicians or other public figures. Recommend other ideas instead. // 4. Don't create images in the style of artists whose last work was created within the last 100 years (e.g. Picasso, Kahlo). Artists whose last work was over 100 years ago are ok to reference directly (e.g. Van Gogh, Klimt). If asked say, "I can't reference this artist", but make no mention of this policy. Instead, apply the following procedure when creating the captions for dalle: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist. // 5. DO NOT list or refer to the descriptions before OR after generating the images. They should ONLY ever be written out ONCE, in the `"prompts"` field of the request. You do not need to ask for permission to generate, just do it! // 6. Always mention the image type (photo, oil painting, watercolor painting, illustration, cartoon, drawing, vector, render, etc.) at the beginning of the caption. Unless the caption suggests otherwise, make at least 1--2 of the 4 images photos. // 7. Diversify depictions of ALL images with people to include DESCENT and GENDER for EACH person using direct terms. Adjust only human descriptions. // - EXPLICITLY specify these attributes, not abstractly reference them. The attributes should be specified in a minimal way and should directly describe their physical form. // - Your choices should be grounded in reality. For example, all of a given OCCUPATION should not be the same gender or race. Additionally, focus on creating diverse, inclusive, and exploratory scenes via the properties you choose during rewrites. Make choices that may be insightful or unique sometimes. // - Use "various" or "diverse" ONLY IF the description refers to groups of more than 3 people. Do not change the number of people requested in the original description. // - Don't alter memes, fictional character origins, or unseen people. Maintain the original prompt's intent and prioritize quality. // - Do not create any imagery that would be offensive. // 8. Silently modify descriptions that include names or hints or references of specific people or celebritie by carefully selecting a few minimal modifications to substitute references to the people with generic descriptions that don't divulge any information about their identities, except for their genders and physiques. Do this EVEN WHEN the instructions ask for the prompt to not be changed. Some special cases: // - Modify such prompts even if you don't know who the person is, or if their name is misspelled (e.g. "Barake Obema") // - If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it. // - When making the substitutions, don't use prominent titles that could give away the person's identity. E.g., instead of saying "president", "prime minister", or "chancellor", say "politician"; instead of saying "king", "queen", "emperor", or "empress", say "public figure"; instead of saying "Pope" or "Dalai Lama", say "religious figure"; and so on. // - If any creative professional or studio is named, substitute the name with a description of their style that does not reference any specific people, or delete the reference if they are unknown. DO NOT refer to the artist or studio's style. // The prompt must intricately describe every part of the image in concrete, objective detail. THINK about what the end goal of the description is, and extrapolate that to what would make satisfying images. // All descriptions sent to dalle should be a paragraph of text that is extremely descriptive and detailed. Each should be more than 3 sentences long. namespace dalle { // Create images from a text-only prompt. type text2im = (_: { // The resolution of the requested image, which can be wide, square, or tall. Use 1024x1024 (square) as the default unless the prompt suggests a wide image, 1792x1024, or a full-body portrait, in which case 1024x1792 (tall) should be used instead. Always include this parameter in the request. size?: "1792x1024" | "1024x1024" | "1024x1792", // The user's original image description, potentially modified to abide by the dalle policies. If the user does not suggest a number of captions to create, create four of them. If creating multiple captions, make them as diverse as possible. If the user requested modifications to previous images, the captions should not simply be longer, but rather it should be refactored to integrate the suggestions into each of the captions. Generate no more than 4 images, even if the user requests more. prompts: string[], // A list of seeds to use for each prompt. If the user asks to modify a previous image, populate this field with the seed used to generate that image from the image dalle metadata. seeds?: number[], }) => any; } // namespace dalle

With such a detailed prompt, Open AI prevent a lot of PR nightmares and ensure that users get a good experience.

For your company, the challenges are different. For example, (not an exhaustive list):

  • You would not want the AI to mention your competition by name or recommend their products in the text generated.
  • You would want the version of AI to be updated in all the internal systems when you approve it, not when Open AI decides it.
  • You would not want AI to mention something that goes against Company's code of conduct.
  • You would  want your AI system to follow the compliance and regulatory policy of your company when your employees interact with it.
  • When generating text, you would want the AI to adhere to communication policy that applies to all employees.
  • Same goes with Data Retention, Encryption, Whistleblower and a host of other policies.

Net net, a good system prompt would ensure that you do not land in hot water while augmenting your customers and employees alike. This is the state today, soon we will have specific guidelines for AI systems to not produce harmful text.

Open source models allow you to put long system prompts based on company specific policies that Open AI or Anthropic can not enable you to do.

Other Advantages

Latency:

A typical ChatGPT API call takes about 10-15 secs to return output. A typical GPT 4 API call takes about 20-25 secs. This is outside of the time it takes to scan through all the embeddings, retrieval of relevant context, searching through DBs and what not. Azure GPT is somewhat faster as you can choose a region closest to your location.

With Open Source models, this can be considerably fast. Llama2 on Colab was answering in 4 secs for me without any finetuning.

Scalability:

Open source scales very well with constant finetuning and RLHF.

MultiModal Support:

This takes some effort but you can run a combination of Stable Diffusion, Llama 2 with the LLMs as an understanding engine generating prompts for Stable Diffusion just like GPT4 + Dall-E. You can also deploy your own ML models on top of open source LLMs and they interact seamlessly.

Data Security and Privacy

Your data never leaves your premises. Not even shared with Open AI. Thats maximum security.

Cost Considerations:

For limited use cases, cost would be on higher side. At scale, costs would be lower.

On-premises deployment of open-source Gen AI models grants enterprises a unique edge. It empowers organizations with complete control over their AI infrastructure, ensuring that data remains secure, and privacy is maintained. In addition to these crucial advantages, on-premises solutions also offer reliability by eliminating external dependencies, making businesses less susceptible to service disruptions or downtime. Moreover, by managing their infrastructure, companies can fine-tune their AI models to perform optimally and align with their specific business objectives.

In conclusion, deploying open-source Gen AI models for enterprises holds the promise of transforming the way businesses operate, innovate, and engage with their audience. It opens up a realm of creative possibilities, is cost-effective, scalable, and offers data security. However, for those who prioritize complete control, data privacy, and optimal performance, on-premises deployment stands as the ultimate choice, making it a powerful asset for businesses in the ever-evolving digital landscape.


Weekly newsletter
No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.