Large Language Models (LLMs) represent a transformative technology in the realm of generative artificial intelligence (AI). These machine learning models are designed to understand and generate text that mirrors human language, learning from vast amounts of textual data through a rigorous process of self-supervised and semi-supervised training.
LLMs have a broad range of applications, from generating text and automating workflows to sparking creative ideas and even writing software code. Some of the most prominent LLMs include OpenAI’s GPT series (such as GPT-3.5 and GPT-4), Google’s Gemini, Bard and PaLM, Meta’s LLaMA and BLOOM, Ernie 3.0 Titan, and Anthropic’s Claude 2.
Given the potential of LLMs to revolutionise business operations, many organisations are eager to integrate these models into their workflows. A common query that arises is whether it is necessary to develop task-specific custom LLMs to enhance performance. Integrating LLMs into business workflows requires careful planning and evaluation. This article is particularly relevant for organisations outside the technology sector contemplating the integration of LLMs into their AI ecosystem. Before addressing the fundamental question of whether to create a custom LLM, it is crucial to understand the prerequisites for developing a custom LLM.
Considerations before creating a custom LLM
Before embarking on the creation of a custom Large Language Model (LLM), several critical considerations should be considered:
- Data volume. LLMs are trained on extensive text datasets, often ranging from hundreds of gigabytes to terabytes in size. For instance, OpenAI’s GPT-3 model was trained on over a trillion words gathered from various internet sources. Ergo, a custom LLM for the healthcare sector would require a large volume of data from sources such as medical journals, patient records, clinical trials, and health websites.
- Data quality. The performance of LLMs is directly influenced by the quality and quantity of the training data. Training LLMs with subpar datasets can lead to issues such as bias and overfitting. For example, a custom LLM for the legal sector would require high-quality data that is accurate, relevant, and up-to-date, as well as free from errors, inconsistencies, and duplication.
- Data diversity. The training data should be collected from a variety of sources, including books, web pages, scientific papers, and online forums. This diversity enables the model to learn nuanced language patterns and semantics. For example: A custom LLM for the entertainment sector would require diverse data that covers different genres, styles, formats, and audiences, as well as cultural and historical references.
- Data pre-processing. The creation of a custom LLM requires robust and flexible data pipelines that can handle tasks such as cleaning and normalisation, tokenisation and vectorisation, handling missing data, and data augmentation. For instance, a custom LLM for the education sector would require data pre-processing that ensures the data is suitable for the intended learning outcomes, such as readability, complexity, and alignment with the curriculum.
- Data security. It's crucial to secure datasets containing sensitive information to protect user privacy and comply with industry regulations. For example, a custom LLM for the finance sector would require data security that safeguards the data from unauthorised access, modification, or disclosure, as well as adheres to the relevant standards and policies.
Remember, the curation and annotation of a diverse training dataset that accurately represents the model's domain is a critical aspect of implementing AI solutions.
In addition to data considerations, the time required to train a model is another significant factor:
- Model size. Larger models with more parameters take longer to train. For example, training GPT-3, which has 175 billion parameters, on a single NVIDIA Tesla V100 GPU would take 288 years. For instance, a custom LLM for the travel sector would require a large model size to capture the complexity and variety of travel-related language, such as destinations, attractions, reviews, and bookings.
- Computational resources. The training time can be significantly reduced by using more powerful hardware or distributing the training process across multiple GPUs. For example, a custom LLM for the gaming sector would require substantial computational resources to train the model efficiently and effectively, as well as to support the high-performance demands of the gaming environment.
- Training complexity. The complexity of the training process, including the model's architecture and the optimisation algorithms used, can also impact the training time. For example, a custom LLM for the art sector would require a complex training process that incorporates elements such as creativity, originality, and aesthetics, as well as technical aspects such as style, colour, and composition.
- Practical considerations. In practice, training a state-of-the-art LLM can take several months, even with substantial computational resources. For instance, a custom LLM for the social media sector would require practical considerations such as the trade-off between speed and quality, the availability and accessibility of the data, and the scalability and maintainability of the model.
Remember, while training time is an important factor, it's also crucial to consider the quality of the trained model. Faster training doesn't necessarily result in a better model, nor does slower training yield a superior model. The ultimate goal should be to balance training time with model performance.
For instance, a financial institution might consider creating a custom LLM to automate customer service interactions. However, they must weigh the benefits of a custom model (such as potentially better performance and greater control over the training data) against the costs (including data collection and annotation, computational resources, and training time). They might find that fine-tuning a pre-existing LLM with their customer interaction data is a more cost-effective solution that still delivers high-quality results.
Advantages and disadvantages of custom LLM and vendor models
There are two main options to integrate LLMs into business workflows: using a vendor's LLM or creating their own LLM. Each option has its pros and cons, which are summarised below:
Some advantages of using vendor LLMs include:
- Scalability. An organisation can leverage the cloud-based services of the vendor to train and deploy LLMs, without worrying about computing resources and data storage. For instance, Google Cloud can provide a scalable and reliable infrastructure for LLMs, such as Cloud TPUs and Cloud Storage.
- Cost efficiency. If one doesn't have access to high-end hardware, using the cloud can be a more economical solution. For example, Amazon Web Services can offer pay-as-you-go pricing models for LLMs, such as AWS Lambda and Amazon S3.
- Ease of use. Vendors provide pre-trained models that are ready to use or fine-tune, which can save time and resources. For example, OpenAI can provide access to pre-trained models like GPT-3.5 and GPT-4, which can be used or fine-tuned for various tasks.
- Managed services. Vendors handle the setup, maintenance, security, and optimisation of the infrastructure, reducing the operational overhead. For instance, Microsoft Azure can provide managed services for LLMs, such as Azure Machine Learning and Azure Cognitive Services.
- Continual updates. Vendors typically provide regular updates to their models, ensuring one benefits from the latest advancements. For example, Meta can provide continual updates to their models, such as LLaMA 2 and BLOOM, which incorporate the latest research and innovations.
- Support. Vendors often provide support and resources to help one get the most out of their models. For instance, IBM can provide support and resources for LLMs, such as IBM Watson and IBM Cloud Pak for Data.
Whereas disadvantages include:
- Lack of control. The organisation has less control over a vendor's model, including how it's trained and what data it's trained on. A vendor's model might not be aligned with their business goals, values, or ethics, or might not reflect their domain-specific knowledge or terminology.
- Potential for vendor lock-in. Switching vendors can be difficult and costly, especially if one relies on their proprietary models or services. For example, a vendor might change their pricing, policies, or features, or might discontinue their models or services, which can affect their business continuity or performance.
- Cost. There can be ongoing costs associated with licensing a vendor's model, which can vary depending on the usage, features, or quality of the model. For example, a vendor might charge based on the number of requests, the amount of data, the level of accuracy, or the complexity of the task.
Some advantages of using custom LLMs include:
- Customisation. The organisation can tailor the model to their specific needs, which can lead to better performance for specific tasks. For instance, one can train the model on their data, which can capture their domain-specific knowledge, terminology, and preferences.
- Transparency and flexibility. Open source LLMs provide transparency and flexibility, allowing full control over the data and the model. For example, one can use open source LLMs like Hugging Face Transformers or TensorFlow Text, which allow one to modify, extend, or improve the model as one wishes.
- Cost savings. While the initial investment can be high, in the long run, owning the model can be less expensive as there are no ongoing licensing fees. So, one can avoid paying for the vendor's model or services and only pay for the infrastructure costs, which can be reduced by using efficient hardware or software.
- Added features and community contributions. One can add features to the LLM that benefit their specific use case and take advantage of community contributions. Such as, they can add features such as sentiment analysis, summarisation, or translation to the LLM, and use community-contributed models or datasets from platforms like GitHub or Kaggle.
- Data security. One can ensure the security of their data, which is particularly important if the model is trained on sensitive or proprietary information. For example, they can encrypt, anonymise, or obfuscate their data, and use secure protocols and platforms to store and access the data.
Some disadvantages include:
- Resource intensive. Training LLMs require significant computational resources and expertise, which can be challenging to acquire and maintain. For instance, one might need to invest in high-end hardware, such as GPUs or TPUs, or hire skilled professionals, such as data scientists or machine learning engineers, to train LLMs.
- Maintenance. The organisation is responsible for maintaining and updating the model, which can be time-consuming and complex. For example, they might need to monitor, debug, or retrain the model, or keep up with the latest research and developments in the field of LLMs.
- Time-consuming. The process of training and fine-tuning an LLM can be time-consuming, depending on the size and complexity of the model and the data. For example, it might take several weeks or months to train a state-of-the-art LLM like GPT-4, which has 2.5 trillion parameters.
The choice between creating a custom LLM or using a vendor model depends on their specific needs, resources, and expertise. It's important to carefully consider these factors before making a decision.
Challenges to create custom LLM
Organisations may encounter several challenges when creating their custom LLMs:
- Lack of expertise. Developing, training, and maintaining LLMs requires specialised skills in machine learning, natural language processing, and data science, which organisations may lack. Ergo, they may need to hire external consultants or train existing staff to acquire the necessary expertise for LLMs.
- Resource intensive. Training LLMs require significant computational resources, which companies may not possess. Additionally, maintaining and updating these models requires ongoing investment. So, they may need to purchase or rent high-end hardware, such as GPUs or TPUs, or use cloud-based services, which can incur high costs for LLMs.
- Data privacy and security. Handling sensitive data for training LLMs could pose data privacy and security risks, which organisations may not be prepared for. For example, a company may need to implement data protection measures, such as encryption, anonymisation, or obfuscation, or comply with data regulations, such as GDPR or CCPA, for LLMs.
- Time-consuming. The process of training and fine-tuning an LLM can be time-consuming, which many companies may not be able to afford. Therefore, they may need to allocate a large amount of time and resources to LLMs, which could divert them from their core business operations.
- Language limitations. It has been difficult to develop AI systems in languages other than English due to the resource gap. This could be a barrier for companies operating in non-English speaking regions. For example, a multi-national organisation may need to source or create data in other languages or use multilingual models, which can be challenging or costly for LLMs.
- Generalisation issues: Generalised AI is trained on vast and diverse datasets, allowing them to handle a wide array of tasks reasonably well. However, they may not perform as well on specific, complex enterprise operations.
Therefore, for many organisations, it may be more practical and cost-effective to use vendor-provided LLMs, which are ready-to-use, regularly updated, and come with support.
LLMs for information retrieval
For many organisations first LLM use cases involve organisation-specific content through a natural language-based interface or a chatbot. LLMs can indeed be used as an information source, but there are some important factors to keep in mind:
- Accuracy. LLMs are trained on vast amounts of data, but they cannot verify the accuracy or timeliness of the information they generate. They can sometimes produce incorrect or outdated information about a product, service, or policy, which could mislead or confuse customers or employees.
- Context. LLMs generate text based on patterns they've learned from their training data. They do not understand context in the same way humans do. This means they might not fully grasp the nuances of certain topics or questions. Therefore, an LLM might generate irrelevant or inappropriate information for a specific query, which could frustrate or offend users or stakeholders.
- Bias. LLMs can unintentionally propagate biases present in their training data. This can lead to biased or unfair information that reflects stereotypes, prejudices, or discrimination, which could harm the reputation or values of the organisation.
- Lack of common sense. Despite their impressive capabilities, LLMs often lack common sense reasoning. They might generate outputs that, while grammatically correct, are nonsensical or illogical information that contradicts facts, common knowledge, or common sense, which could undermine the credibility or trustworthiness of the organisation.
- Data privacy and security. If an LLM is trained on sensitive or proprietary data, using it as an information source could potentially expose this data. This could violate data privacy and security regulations or policies, or cause legal or ethical issues.
- Speed of change. The speed at which organisational data, policy and product changes, creating or fine-tuning an LLM may not match. Hence it is always a good idea to keep the source of truth separated from the LLM. For example, an LLM might generate outdated or inconsistent information that does not reflect the current state of the organisation, which could cause confusion or errors.
While LLMs can be a valuable tool for generating text and providing information, they should not be the sole source of information. It's important to cross-verify the information from other reliable sources, such as databases, documents, or experts.
LLMs can be effectively used for enterprise information retrieval in several ways:
- Leveraging LLM APIs. The first way to use LLMs in an enterprise context is to make an API call to a model provided as a service. This approach has several advantages, including a low barrier to entry, access to more sophisticated models, and fast responses. However, it may also be inappropriate for certain enterprise applications due to data residency and privacy concerns, potentially higher costs, and dependency on the service provider. For example, an organisation might use an LLM API to generate text for marketing campaigns, customer service, or internal communications, but they might also need to consider the data sovereignty, security, and cost implications of using a third-party service. Moreover, this might not be enough to overcome the challenges of using vendor models.
- Running an open-source model in a managed environment. The second option is downloading and running an open-source model in an environment that the organisation manages. This gives the organisation full control over the data and the model, ensuring data privacy and security. For instance, an organisation might run an open-source model in their own cloud or on-premises infrastructure, which allows them to customise, modify, or improve the model as they wish, and to secure their data from unauthorised access or disclosure. The vendor models’ challenges might still apply to this technique as well.
- Retrieval-Augmented Generation (RAG): Leverage external knowledge sources to enhance responses and retrieve specific information from organisational databases. For example, the model can access pertinent documents within a database and leverage this information to formulate responses. This approach stands as the most favoured method for extracting precise information from the organisational knowledge base and articulating it in natural language. Noteworthy is the utilisation of a comparable strategy by Bing.ai, where the process involves initial comprehension of user queries, subsequent online searches for relevant content, and the utilisation of exclusively identified pages to construct a comprehensive and accurate response.
- Connect LLMs to external data. Cross-reference responses with trusted external databases to enhance answer verification. For example, an organisation might connect an LLM to external data sources, such as Wikipedia, news articles, or databases, which can enrich the information generated by the LLM and improve its quality and relevance.
- Pairing LLMs with high-performance databases. LLMs can be paired with highly scalable, high-performance databases on the back end that can take queries and analytics code generated by LLMs, use them to scan millions or billions of records and translate the data into insights. For instance, an organisation might pair an LLM with a high-performance database, such as Snowflake, which can handle large-scale data analysis and provide fast and accurate insights for business intelligence, decision making, or reporting.
- Fine-tuning for specific tasks. LLMs can be fine-tuned on specific tasks or domains such as legal, medical, or financial to improve their performance. This requires a larger volume of data and more computational resources. However, many of these tasks are common in nature and procured from a vendor, e.g., GitHub Copilot to assist in writing software code or Pega GenAI to create workflow faster.
Remember, the best way to use LLMs for enterprise information retrieval depends on the specific needs and resources of the organisation. It's important to carefully consider these factors before making a decision.
Concluding remark
Creating a custom LLM demands a substantial investment in high-quality, domain-specific training data, along with significant computational resources and expertise in machine learning and natural language processing. The decision to pursue a custom LLM should be meticulous, considering the organisation's unique requirements, available resources, and potential return on investment. Amidst the exciting era of AI in business, LLMs present a significant opportunity for innovation and efficiency enhancement.
Alternatively, leveraging pre-trained LLMs and fine-tuning them for specific tasks can prove more efficient and cost-effective. For instance, healthcare and law firms can automate tasks like patient communication and legal document drafting. Yet, it's crucial to augment this approach with validation methods like RAG or external sources to ensure accurate responses. Another strategy involves using LLMs solely for understanding user queries and formulating responses, while obtaining information from the organisation's trusted source of truth, eliminating the need for fine-tuning of models.
Helping clients take "Data led Decisions and driving ROI using AI"
8moNice one, it was good to see you at CDAO Melbourne
HR Operations | Implementation of HRIS systems & Employee Onboarding | Exit Interviews | Employee Relation| Employee Satisfaction Survey | Employee Retention | Employee Engagement | Employee Grievances
11moWell elaborated. In addition to Machine Hallucinations, Machine Endearment, and potential breaches due to malware prompt injections, LLMs possess several disadvantages despite their widespread use. These drawbacks include requiring vast amounts of cleansed data, thereby leading to an inability to provide correct answers, and lacking real-time internet search capabilities. LLMs also fall short in subject matter expertise, which was evident for example in Facebook's Galactica providing incorrect scientific answers and hence being discontinued. LLMs also struggle with qualitative judgment, unable to discern sentiments, emotions, or ethical considerations, and occasionally produce politically incorrect or extreme responses. Guard rails and censorship implemented to address ethical concerns often end up compromising accuracy and introducing bias through reinforcement learning. Additionally, LLMs often time out when summarizing lengthy content and generate generic outputs that require user refinement. Indeed, research efforts are underway to enhance LLMs' text ingestion capabilities, aiming to mitigate these limitations in the coming years. More about this topic: https://lnkd.in/gPjFMgy7
Software Engineer
1yThis was an excellent article. Thanks!