Navigating Model Selection for NLP tasks

Abstract¶

For many natural language processing (NLP) tasks today, several AI tool options exist. Which tool one chooses for their task depends on many factors-such as previous success of different tools for their type of data, latency, and a lot of experimentation and research to actually test different options. Sometimes, for an application, one may need to perform a particular NLP task, but still require usage of different tools. For example, named entity recognition where you need to extract multiple entities of different types, may be accomplished using numerous methods or tools, rather than one tool being used for extraction of all entities. This paper walks through decision making on open-source tools for popular NLP tasks like named entity recognition, sentiment analysis, text similarity, and more. The paper also introduces nlprw_toolkit. The nlprw_toolkit ^[1] offers a solution to this issue by simplifying the process of selecting the most suitable tool or set of tools for a given task. It achieves this by integrating various tools necessary for completing complex tasks when a single tool doesn’t suffice. Moreover, it assists in selecting the appropriate tool by considering factors like the language style of data and the desired outcome. The toolkit also helps you run multiple methods for a task so you can compare their outcomes and make an informed decision in your tool choice.

Keywords:NLPNatural Language ProcessingLanguage data¶

Introduction¶

NLP, or Natural Language Processing Jones, 1994, is a field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. Natural language includes any language humans use to communicate with one another, including audio/speech Singh, 2019 and text. More commonly, NLP refers to use with text data. Audio data processing and analysis has a bigger overlap with the domain of digital signal processing and involves different processing techniques Singh, 2022 compared to text.

NLP encompasses a wide range of tasks aimed at enabling computers to understand, interpret, and generate human language in a manner that is both meaningful and contextually relevant. Some common tasks in NLP include:

Text Classification: Assigning categories or labels to text documents based on their content, such as spam detection, sentiment analysis, or topic categorization.
Named Entity Recognition (NER): Identifying and categorizing named entities (such as names of people, organizations, locations, etc.) within text documents.
Sentiment Analysis: Determining the sentiment expressed in a piece of text, such as whether it is positive, negative, or neutral.
Text Summarization: Generating concise summaries of longer pieces of text, preserving the most important information.

Other examples of NLP tasks include Question Answering on given text, language translation, Language Generation, and more in Meyer, 2021 and Wolff, 2020. These tasks and many others in NLP play a crucial role in various applications, including search engines, virtual assistants, social media analysis Singh, 2021, language translation, and more.

In today’s landscape, there’s a plethora of tool options available for various tasks. Selecting the most suitable tool hinges on numerous factors, including past successes with similar data types, latency considerations, extensive experimentation and research to evaluate different options.

In certain scenarios, a single NLP task may necessitate the utilization of multiple tools. Take named entity recognition, for instance, where extracting diverse entities might require employing various methods or tools, rather than relying solely on one tool for extracting various types of entities. Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc and adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text Banerjee, 2018 Direct, n.d.. Several tools such as SpaCy Honnibal et al., 2020, NLTK Loper & Bird, 2002, Stanford NER Jenny Rose Finkel & Manninga, 2005 and others can be used for this purpose.

Additionally, when faced with the task of selecting from numerous tool options for a particular task, there are often requirements for writing necessary code and utilizing the required tool for each possible option under consideration. For example, in a text similarity based recommendation system, which embedding model will one choose, given that there are 100s of such models available today. One will likely choose a couple of models and evaluate them solely based on a general understanding of the type of data each model is expected to perform better with. This knowledge also comes with experience, making this process more cumbersome for individuals less familiar with or new to NLP and the vast World of NLP model and tool options.

The toolkit nlprw_toolkit ^[2] is an attempt to help with this problem and draws learnings from the book Natural Language Processing in the Real-World published by CRC Press/Taylor and Francis Singh, 2023, which contains real world NLP applications across 15+ industry verticals and solutions for most prominent NLP use cases, and the accompany code NLP-in-the-Real-World ^[3] that serves as a reference guide for executing a large no. of NLP applications, including end to end implementation of real world use cases, all using open source tools.

Methods and Results¶

Tool selection¶

NLP has gained a lot of popularity over the last decade. This has been accompanied by several useful open-source tools and models that can be leveraged for a variety of tasks. Several paid services from cloud providers have been launched in this space as well, creating a plethora of options of businesses to easily plug their data into models, without putting in the work to create them from scratch. There still exist a lot of use cases requiring custom model building, but the availability of many great options have led to the creation of a standard workflow of approaching solution building in NLP. First, the data science / machine learning developer gauges the probelm statement, understand available data, and evaluation based on business goals. Then, when it comes to building the solution, the developer first explores existing tools and models. If existing tools have gaps, then further work can be done to fill those gaps. If the existing solutions don’t work, training a custom model is the next step.

Data Scientists spend at least 20% of their time in model selection and training ^[4]. This time is even larger when there is a new/unfamiliar problem to solve. The problem may have existed, but it is the developer’s first time trying to build a solution for it that fits their data. However, with so many available tools to choose from, how can the choice be made? For this, the developer needs to read-up a lot about available options and then find similar patterns to what might work on their data, then proceeding with experimentation with trial and errors. The process of tool selection has a gap in the market today and can be made quicker. Usually, as a developer gains experience, this process gets easier and less time-consuming over time, which this problem remains bigger for early career professionals. To solve this problem, this section explores some common NLP tasks and how you can short-list tools based on your data. The nlprw_toolkit integrates this knowledge and shares a way to make this assessment using the toolkit directly. This section sheds more light on decision making while selecting tools and shares code samples of doing so using the toolkit.

Based on desired task¶

Let’s consider named entity recognition (NER) for example. NER is a very popular task, finding applications across industry verticals as well as research and academia projects. Some popular open-source tools for this task include SpaCy and NLTK. If you want to extract email IDs from text, then tools like spacy NER are not very helpful, and a regex expression will be useful instead. This is because entities like email IDs and phone numbers have predictable patterns that compose them, which make pattern matching techniques a more viable option. It is also computationally less expensive to implement pattern matching using regex. Example regex for emails:


import re

re.finditer("\S+@\S+\.\S+", text) # email pattern

But if you want to extract dates, then NER tools like SpaCy, NLTK, or other transformer-based NER models will likely yield better results. Spacy comes with several models of different sizes that you can choose from for NER.


import spacy

nlp = spacy.load("en_core_web_sm") # small model trained on web-based data
doc = nlp(text)
for word in doc.ents:
	print(word.text, word.label_)

Furthermore, for cleaner data, you can opt for smaller SpaCy models for NER and may not see much lift in accurate resposes with larger models. For more complex and noisy data, larger models may do better, including transformer models. Note that larger models will lead to higher latency as well. Thus, keep in mind the trade-off as well while making the choice.

If you want to extract an entity not offered by any of these existing models, then you may need to train your own model, which you can do using SpaCy itself, and build on top of its existing models. Code ^[5] shows an example of doing so, buildign a custom NER model using SpaCy, and ^[6] shows code to build a custom NER model using tranformer-based LLMs.

Thus, based on the desired task within the same NLP application, the tool choice can vary.

Using this nlprw_toolkit you can specify details on the entities you are interested in, and the tool selects the model it needs behind the scenes and gives you all the extractions you want. Internally the toolkit may use multiple tools behind the scenes or a single tool, depending on the task, but the user experience remains singular. An example with SpaCy and regex is shared below.


from nlprw_toolkit import infoextractor

doc = "please write me at fejfow@iejf.com tomorrow about MOM Mission statement by 12.30 pm."
entities = ['email', 'DATE', 'TIME']

extracted_entities, model_selection = run(doc, entities=entities)

print(extracted_entities)
# >> {'email': [('fejfow@iejf.com', 19, 34)], 'DATE': [('tomorrow', (5, 6))], 'TIME': [('12.30 pm', (11, 13))]}

print(models)
#>> {'regex': ['email'], 'spacy': ['DATE', 'TIME']}

Based on type of data (quality/source)¶

To exemplify, let’s consider the task of sentiment analysis. Sentiment analysis is a very popular tasks and is very important for several business applications across e-commerce, social media, finance, and other verticals. Popular use cases are identifying sentiment from customer reviews about a product or service. This helps inform businesses on how well something is doing and impacts their action strategies.

There are many open-source pre-trained models that can be used for a majority of data types for sentiment analysis. The different models are trained using various sources of data. Choosing the model that is likely to be better on your type of data has the best chances to give you desirable results. For instance, VADER Hutto & Gilbert, 2014 may be preferable if your data contains informal language, whereas TextBlob Loria, 2018 may be better if the language in your text is more formal. VADER is trained on a variety of data. This data includes customer reviews, but also informal language data that is likely to contain typos and emoticons, very similar to the kind of language you see and expect on social media (e.g., :-), LOL, nah, meh.) Several studies ES, 2023 have also reported VADER doing better than TextBlob on informal language, where TextBlob does not appear to understand informal language nuances as well and does better on text with more structure.

If choosing between TextBlob and VADER, based on the aggregated knwoledge from several studies, VADER will likely be a better choice if dealing with informal language, data with many terms that can’t be found in a dictionary, or data from sources like social media. TextBlob is likely to be a better choice is you have more formal and reasonably structured language, fewer terms that can’t be found in a dictionary, or language found in review comments like hotel reviews. VADER may also do well for review comments from sources like movie reviews and social media reviews.

Overall, many factors play into this choice, including the following.

STYLE="formal", "informal", "mixed"
TYPOS="many_nondict_terms"; "some_nondict_terms", "mostly_clean"
SOURCE="social_media", "review_comments", "articles"

Using the tool, you can specify the kind of language contained in your data. Based on the information provided, the nlprw_toolkit makes recommendation for the model and returns the computed sentiment using the recommended model.


from nlprw_toolkit import sentiment

# example 1
sentences = ["i love you", "you dislike me or what?", "you hate ice cream", ""]
sentiment, model_choice = get_sentiment(
	sentences,
	style={"mixed": True},
	typos={"mostly_clean": True},
	source={"review_comments": True}
)
print(sentiment)
# >> ['positive', 'neutral', 'negative', 'negative']

print(model_choice)
# >> 'textblob'

# example 2
sentences = ["Show was really good it ws soooo fun."]
sentiment, model_choice = get_sentiment(
	sentences,
	style={"informal": True},
	typos={"many_nondict_terms": True},
	source={"social_media": True, "review_comments": True}
)
print(sentiment)
# >> ['positive']

print(model_choice)
# >> 'vader'

If you have labeled data available for any task, but not in big enough quantities to train a model, you can leverage it to test existing tools performance to find which tool may be the better choice for your data. This evaluation can help shortlist tools with stronger corroboration. For example, there are multiple sentiment analysis libraries with pre-trained models that work well for many data sources and types. Passing in the libraries available that you want to test against your test data, along with passing in the test data will enable you to compute and compare evaluations. If you don’t know which tools may be options for a task like that, you can let the toolkit run through its default model options and recommend a model for you.

Full project examples¶

Recommendation system¶

The core of text-based recommendation systems is finding similarity between a reference text and a corpus of text documents. Documents from the corpus which exhibit the highest similarity are likely good candidates to show as recommendations. The key is finding numerical representations of the text and computing similarity metrics.

Lets look at an example of accessing and evaluating multiple tool via a single interface, based on desired task components and type of data (quality/source).

Let’s take text similarity for instance. There is a corpus of text samples and one piece of text for which one wants to find most similar text samples in the corpus. There are many embedding models that can be used to compute numeric representations from text followed by using cosine similarity to find semantic similarity between two pieces of text. You can also use methods other than user defined pre-trained embedding models, such as creating your own embedding model.

Consider a scenario where you possess a corpus containing lengthy sentences. Traditional numerical representations like one-hot encoding or TF-IDF result in sparse vectors, where many elements are zeros. An alternative approach involves using dense representations, such as word embeddings. Word embeddings offer a method to represent words numerically within a corpus. This results in a vector for each term in the corpus, where each vector is of uniform size, typically much smaller compared to TF-IDF or one-hot encoded vectors. Examples of word embeddings include word2vec ^[7] (tools: gensim, SpaCy), fastText ^[8], Doc2Vec ^[9], GloVe embedding ^[10], ELMo ^[11], universal sentence encoder Cer et al., 2018, transformers ^[12] and further ever growing transformer-based models/LLMs. Here is a curated list of pros and cons of these embedding models, providing more context Singh (2023).

The main disadvantage of Word2Vec is that you will not have a vector representing a word that does not exist in the corpus. For instance, if you trained the model on biological articles only, then that model will not be able to return vectors of unseen words, such as "curtain" or "cement".
The advantage of fastText over Word2Vec is that you can get a word representation for words not in the training data/vocabulary with fastText. Since fastText uses character-level details on a word, it is able to compute vectors for unseen words containing the characters it has seen before. One disadvantage of this method is that unrelated words containing similar characters/alphabets may result in being close in the vector space without semantic closeness. Example, words like "love", "solve", and "glove" contain many similar alphabets "l", "o", "v", "e", and may all be close together in vector space.
Doc2Vec is based on WordVec except it is suitable for larger documents.
Glove vectors treat each word as one, without considering the same word can have multiple meanings. The word "bark" in "a tree bark" will have the same representation as "a dog bark". Since it is based on co-occurrence, which needs every word in the corpus, glove vectors can be memory intensive based on corpus size.
ELMo can handle words with different contexts used in different sentences, which GloVe is unable to. Thus the same word with multiple meanings can have different embeddings.
Tranformer-based models are larger, however, have more understanding on language in general.

For text similarity task, representativeness of the text corpus matters a lot. If the text for which you want to find similar pieces in the corpus does not have representation in the corpus, then you need to opt for models with a more general understanding, thus pre-trained models would be useful. However, if your data is domain specific and/or representative of the text you want to pass through the model, then a model with generic understanding may not be advantageous, and may actually hurt the performance. In this case, training your own model would be preferable, and starting with the simplest TF-IDF can be hihgly beneficial. TF-IDF will also be computationally least expensive and often times suffice for the task at hand. Opting for other more complex models should be done by establishing baseline with the smaller and simpler models first.

For instance, lets say you have a corpus that is specific to text related to ‘Python’ - both the programming language as well as the snake.

The text you want to find similar documents for is

Spotted rattle skin in the field

Since your dataset is specific to Python, the dataset may not have representative data to the sample you are analyzing in the corpus. Thus, choosing a generic understanding model is likely to yield better results.

Using generic pre-trained embedding model from SpaCy, the top result returned to the above text is:

Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I’ve been bit by a snake

Using a custom trained embedding model instead (TF-IDF based) yields the following top result. (consequence of wrong tool choice)

Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like …

By passing details about the data in the toolkit (in this case setting ‘sample_likely_represented_in_corpus’ to False), the toolkit makes such determination automatically and recommends the tool of choice, and returns text similarity scores using the recommended tool. Similarly, whether the data is domain specific or not and other details about the data influence the tool choice as well.


from nlprw_toolkit import rec_sys

corpus = [
	"Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like ...",
	"Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake"
]
sample = "spotted rattle skin in the field"

print("'sample_likely_represented_in_corpus': False")
recs = run_rec_system(
	corpus,
	[sample],
	top_n=2,
	data={'corpus_domain_specific': True, 'sample_likely_represented_in_corpus': False}
)
print(recs)
#>> [[("Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake", 0.731253418677792), ('Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like ...', 0.6951807783413815)]]

print("'sample_likely_represented_in_corpus': True (consequence of wrong tool choice)")
recs = run_rec_system(
	corpus,
	[sample],
	top_n=2,
	data={'corpus_domain_specific': True, 'sample_likely_represented_in_corpus': True}
)
print(recs)
#>> [[('Python Machine Learning Tutorial (Data Science) Python Machine Learning Tutorial - Learn how to predict the kind of music people like. Subscribe for more Python tutorials like ...', 0.17008208798133495), ("Ball python bite I decided to grab a snake out of its tank without asking, this was totally my fault. But now I can say I've been bit by a snake", 0.0)]]

Comment review analysis¶

Let’s say you have hotel review comments and want to understand them better, analyze sentiment, understand complaints reported in the data. Let’s see how you can get started on this quickly using the toolkit. Once you run the review-analysis with a list of comments, the following information prints to give you stats about the data, as well as sentiment computed based on the recommended model by the toolkit based on your description of the data (as described in 1(b)), as seen in 1.

Total no. of reviews are 1837
Shortest review length: 10 chars
Longest review length: 793.0 chars
Mean review length: 980.9229311433986 chars
Median review length: 19846 chars

Figure 1:Sentiment breakdown in the corpus of user review comments.

Now, viewing each comment and its sentiment manually could be time consuming. The tool leverages popular visualizers like matplotlib Hunter, 2007 and wordcloud Oesper et al. (2011) and puts them together on top of your data to easily visualize the words that make up each sentiment. Top words, mainly nouns, found in positive sentiment comments can be seen in 2. The bigger the word, the more commonly it was found.

Top words, mainly nouns, found in negative sentiment comments can be seen in 3.

Figure 3:Negative comments - word cloud.

This gives an analytical understanding of the underlying data to a person trying to analyze this data. For instance, people with more negative reviews talked more about night time, bed, staff, and desk. People with positive reviews spoke about time, staff, location, bathroom, etc. It shows that some people had a positive experience with the staff, while some may have had a negative experience. There are other visualization tools that may be preferred over word cloud that can be leveraged for analytics as well.

Next, let’s say you want to create a classification model on the negative comments data, to understand whether the negative comment was about a staff member or about the property itself. How can you do this without any labeled data? This isn’t the most intuitive stage for understanding the next set of steps to follow, especially for early career or new professionals. This is where no-labeled data techniques like zero-shot classification Yin et al., 2019 come in handy. Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes Face, n.d.. Here, you can use a pre-trained model with general understanding of text and words, and see a comment is closer to which class/category in semantic space. This can be a great starting point and can also form a labeler for your data, which you can verify and then train a custom classification model using that data. You can also try relatively smaller LLMs for this task. An example below uses a bert based model for zero-shot classification.


from transformers import pipeline

categories = []# define list of categories you want your data classified into
classifier = pipeline(
	"zero-shot-classification", model="typeform/distilbert-base-uncased-mnli"
)

classifier(sentence, candidate_labels=categories)

’sequence’: "My experience at check-in counter was terrible. The staff was kind of rude and didn’t want to help out much. No complaints otherwise.", ’labels’: [’staff person’, ’hotel property’], ’scores’: [0.7497782707214355, 0.25022172927856445]

’sequence’: "The carpet in the room was quite stained. I am surprised they didn’t replace it given its condition.", ’labels’: [’hotel property’, ’staff person’], ’scores’: [0.6811559796333313, 0.3188440203666687]

’sequence’: ’I love everything. The front desk was helpful.’, ’labels’: [’staff person’, ’hotel property’], ’scores’: [0.6326051950454712, 0.3673948347568512]

The toolkit also suggests an appropriate classification method as above based on your data in the absence of labeled data and returns the results for you.

Limitations and Future work¶

There are several factors that play an important role in tool selection, and optimization of this process is an ongoing and evolving effort. This paper represents information for helping with decision making for tool selection for popular NLP tasks. It also presents a toolkit which represents an early effort to simplify tool selection and facilitate the use of multiple tools through a single interface, incorporating practical logic for tool selection. Numerous updates can further enhance the toolkit’s offerings. Considering the dynamic nature of NLP in AI, several functionality and software updates will aid in keeping pace with the rapid advancements in the field.

Conclusion¶

In this paper, considerations and decision making for popular NLP tasks is shared with examples. The concept is to make the tool choice process faster and easier for individuals. This is especially helpful for individuals new to the field. NLP is a growing field with a lot of increase in people trying to leverage this technology for many tasks. nlprw_toolkit is introduced which is an early attempt to integrate the decision making into an open-source tool. Popular NLP tasks such as text classification, summarization, named entity recognition, sentiment analysis can be done faster with informed tool choices using the toolkit. Full Real world use cases, including recommendation systems and customer review analysis were presented as examples that can be built using informed tool choices.

Footnotes¶

https://github.com/jsingh811/NLP-in-the-real-world/tree/toolkit/nlprw_toolkit
↩
https://github.com/jsingh811/NLP-in-the-real-world/tree/toolkit/nlprw_toolkit
↩
https://github.com/jsingh811/NLP-in-the-real-world
↩
/urlhttps://businessoverbroadway.com/2019/02/19/how-do-data-professionals-spend-their-time-on-data-science-projects)
↩
section5/training-ner-spacy.ipynb
↩
section5/transformers-ner-fine-tuning.ipynb
↩
https://radimrehurek.com/gensim/models/word2vec.html
↩
https://ai.facebook.com/tools/fasttext/
↩
https://radimrehurek.com/gensim/models/doc2vec.html
↩
https://nlp.stanford.edu/projects/glove/
↩
https://paperswithcode.com/method/elmo
↩
https://www.sbert.net/docs/pretrained_models.html
↩

References¶

Jones, K. S. (1994). Natural Language Processing: A Historical Review. In Current Issues in Computational Linguistics: In Honour of Don Walker (pp. 3–16). Springer Netherlands. 10.1007/978-0-585-35958-8_1
Singh, J. (2019). An introduction to audio processing and machine learning using Python. https://opensource.com/article/19/9/audio-processing-machine-learning-python
Singh, J. (2022). pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling. Proceedings of the 21st Python in Science Conference, 152–158. 10.25080/majora-212e5952-017
Meyer, P. (2021). Natural Language Processing Tasks. https://towardsdatascience.com/natural-language-processing-tasks-3278907702f3
Wolff, R. (2020). 11 NLP Applications & Examples in Business. https://monkeylearn.com/blog/natural-language-processing-applications/