Large Language Models in Advanced Research and Development

Feb 12, 20247 min read

According to industry research from Forbes, knowledge workers (R&D professionals included) spend a significant portion of their workweek—approximately 25% of their time—navigating through various research tools, databases, and information systems just to locate data. This considerable time investment in information retrieval can impact the overall efficiency and productivity of any organization. Decreasing the time required just to find information could significantly decrease product-to-market lifecycles... enter Large Language Models.

Large Language Models (LLMs), like GPT-4, Claude-2, and Llama2 (in the form of apps like ChatGPT, Perplexity, Bard/Gemini, and others), have garnered significant attention in the business world due to their potential to enhance and simplify various operations. LLMs are a type of AI model that can produce text and code responses based on an input question and potential follow-up questions/context from the user (or query). These models are trained on massive datasets, where they analyzed text with diverse language patterns in a mathematical way to be able to answer questions in a human-like form. It is possible to train these models on internal datasets to be able to provide users with pertinent internal information.

While it seems they are creating something out of thin air, they are truly just answering an extraordinarily difficult math problem based on statistical associations within the training data. In other words, they are generating answers based on analyzing the data provided. The common phrase "you are what you eat" could be applied to LLMs as "LLMs are what they analyze" - they can only produce quality information if they were trained on quality information.

LLMs are quickly gaining popularity in sales and content oriented organizations - here is a list of 40 companies that have implemented LLMs. However, their application for research and development contexts is just beginning.

The Current Landscape for LLMs in R&D

Fredrik Forsén, an Industrial Engineer, wrote a research paper in 6/2024 titled "Large Language Models and Business Applications in R&D Environment". He seeks to answer the following questions:

What are the current bottlenecks in office R&D work?
How can generative AI tools and LLMs be used to resolve these bottlenecks?
What LLM tools are currently available and what are their strengths, weaknesses, and limitations?

He surveyed the Mirka Power Tools' R&D department to answer the above questions. Despite the study's limited number of participants, their responses shed light on the potential benefits of using customized LLMs in research environments, especially for improving information retrieval. Below are some graphs from the research paper showing survey responses- please note, it was conducted in Finland so there are some minor differences in the visuals' formats.

Tailoring LLMs for Specific Research and Development Applications

As shown in Forsén's research, general LLM tools are helpful in providing answers to generic questions. The current gap lies in their ability to answer questions specific to an organization's internal data sets and systems.

With the development of Retrieval Augmented Generation (or RAG), it is now possible to tailor the implementation of an LLM to reference pertinent company information without having to retrain or fine-tune the whole model. It is a cost-effective approach to improving LLM output, ensuring relevance, accuracy, and usefulness in various contexts. The process involves creating external data, retrieving relevant information, augmenting the LLM prompt, and updating external data to maintain accuracy as illustrated below:

If you want to dive deeper, here is an awesome article from AWS that explains the benefits and drawbacks of RAG models.

Examples of advanced applications of LLMs in R&D:

Computational Chemistry and Drug Discovery: LLMs can analyze molecular structures, predict protein-ligand interactions, and generate novel drug candidates. They can rapidly screen millions of compounds, identifying the highest potential for therapeutic efficacy, thus dramatically reducing the time and cost of early-stage drug discovery. A tool currently deployed in this space is Microsoft's BioGPT - a cutting-edge generative language model in the biomedical field, trained on millions of peer-reviewed research articles. There are fine-tuned versions optimized for questions regarding specific areas of research, including relation extraction for chemical-disease interactions (BC5CDR), drug-drug interactions (DDI), and drug-target interactions (DTI), as well as document classification for cancer research (HoC), demonstrating the model's versatility in addressing complex biomedical information processing challenges. Here is the link to the GitHub repository for BioGPT.
Materials Science Innovation: By analyzing vast databases of material properties and scientific literature, LLMs can predict novel materials with desired characteristics without having to run thousands of experiments. This capability is particularly valuable in developing advanced composites, semiconductors, and sustainable materials. Our team has successfully developed and implemented advanced machine learning models for materials science applications, significantly streamlining the experimental process. These innovative solutions have enabled researchers to achieve comparable results with a substantially reduced number of experiments, dramatically improving efficiency and accelerating the pace of materials discovery and optimization.
Process Optimization and Predictive Analytics: LLMs can analyze complex multivariate data, identifying subtle correlations and optimizing parameters for improved efficiency and quality. Mastercard trained an LLM to analyze transaction data in real-time for fraud detection, processing 75 billion transactions per year across 45 million global locations. The system examines patterns in transaction data, using both transactional information and external data sources like anonymized customer information and geographical data, to detect fraud and reduce false declines by 50%. Here is an article if you're interested to learn more!
Simulation and Modeling: LLMs can enhance the accuracy and speed of complex simulations in fields such as aerospace engineering, climate modeling, and particle physics. Researchers can explore more scenarios and achieve higher fidelity results by integrating machine learning with traditional simulation techniques. This research article introduces a new approach using Large Language Models (LLMs) in digital twins to automate the parametrization of process simulations. A digital twin is a virtual replica of a physical system, synchronized in real-time, which helps in monitoring and controlling processes. The study proposes a multi-agent system framework where LLM agents interact dynamically with simulation models to autonomously determine optimal parameters for achieving predefined objectives. By integrating LLMs, the system aims to enhance user-friendliness and reduce cognitive load in decision-making processes within industrial settings, showcasing its potential through practical demonstrations.

By leveraging these advanced capabilities, organizations can significantly enhance their R&D productivity, accelerate innovation cycles, and maintain a competitive edge in rapidly evolving technological landscapes. The integration of LLMs into R&D processes represents a paradigm shift in how research is conducted, promising to unlock new frontiers of scientific and technological advancement.

Key Considerations for Implementing LLMs

If you believe that implementing a tailored LLM may be a good fit for your organization, these four questions will help you determine what that implementation should look like.

How will your organization interact with the model?

Questions about who interacts with the model, how many people, and their expertise on the information should be considered. The key is to design an interface that allows the users to interact with the LLM in ways that complement their expertise. One example is an LLM for researchers that allows them to dramatically reduce the time spent on literature reviews and accelerate hypothesis generation. Another is using an LLM as a collaborative partner in experimental design, suggesting optimal parameters based on vast datasets of previous experiments across the organization. For cross-functional teams, an LLM could act as a knowledge bridge, translating specialized jargon and identifying synergies between different research streams that might otherwise go unnoticed. Finally, an LLM for your organization's leadership team could be used to analyze internal research reports, grant proposals, and patent applications to identify emerging trends within your organization, helping to make informed decisions about resource allocation and strategic research directions.

Where is your company data stored?

To be able to provide accurate responses, accurate and domain-specific data needs to be provided. Large organizations, especially those that have been around a while, can have data stored in multiple antiquated locations in different formats. Obtaining as much relevant information as possible to provide to the LLM will lead to the best results. Also, models will need to be updated regularly where the frequency depends on the nature of the information. Ensuring your organization's data can be easily accessed to allow for quick retraining will help make the model's accuracy and usage sustainable.

How will the data need to be preprocessed to provide it to the model?

Different LLMs require your organization's data to be preprocessed differently. Preprocessing can be a nightmare depending on the mode in which it is conducted so understanding what data types your model can analyze, what form your current data is in, and how it needs to be changed to work with the desired model is paramount.

Do you have the internal expertise to create and manage your LLM?

It is possible to train your model yourself or reach out to a company that specializes in ML services to help you do so. If you have in-house expertise, there are thousands of pre-trained models to choose from, and HuggingFace Model Hub, TensorFlow, and PyTorch are the go-to locations to acquire a pre-trained model. With thousands to choose from, it is recommended to try out multiple models based on your computing needs, required accuracy, and input data format when selecting one.

Drawbacks of customizing LLM integration

Resource Intensiveness: The development and training of LLMs require a substantial investment of time and resources. This can pose a challenge for organizations with limited processing capabilities or tight timelines, impacting their ability to deploy tailored LLMs promptly.

Domain Limitation: Tailored LLMs are less versatile outside their trained domain, which means they may not perform as effectively when faced with tasks beyond their specific focus. This limitation is evident in industries where a broad range of subject matter is being utilized, such as in legal services and healthcare. Here is a fascinating article about legal LLMs potential to generate false information, or "hallucinate".

Ethical Considerations: LLMs can potentially absorb biases from their training data, leading to negative ethical implications. Organizations must carefully address this issue to avoid unintended repercussions, especially in sectors where unbiased decision-making is critical, including finance and healthcare. This article provides a deep-dive of ethical considerations in healthcare, including the potential of LLMs to continue hidden biases based on race, gender, and income that could lead to unequal patient outcomes.

Training Data Risks: Incomplete or inaccurate training data for tailored LLMs can pose significant risks for organizations, potentially leading to false market trend predictions or security issues. Also, the server location of the LLM and security around who has access to the training data needs to be monitored. Though all industries deal with matters of security, retail, healthcare, and cybersecurity are industries where data accuracy and security are paramount.

If you're interested in learning more, schedule a call with us and we would love to talk with you about the potential of LLMs within your organization.

Sources:

https://gaper.io/custom-llm-vs-general-purpose-llm/

https://www.jnj.com/innovation/artificial-intelligence-in-healthcare

https://www.linkedin.com/pulse/100-real-world-llm-use-cases-from-fortune-500-yuliia-butovchenko-8oiwe/

https://www.forbes.com/sites/forbesbusinesscouncil/2022/04/26/the-cost-of-doing-nothing-with-your-data/

https://www.theseus.fi/handle/10024/863103

https://aws.amazon.com/what-is/retrieval-augmented-generation/

https://enterprisetalk.com/featured/benefits-limitations-of-using-large-language-models-llms/

https://datasciencedojo.com/blog/llm-use-cases-top-10/

https://www.deepset.ai/blog/when-and-how-to-train-a-language-model

https://blog.gopenai.com/a-step-by-step-guide-to-training-your-own-llm-2d81ff810695

https://byby.dev/ai-chatbots