Text Classification with LLMs

Goal: Play with text classification with LLMs and do a small write up about its pro and cons.

Datasets:

Dataset	Source	Categories	GitHub Code
Email Spam Dataset	Kaggle	·Safe Email·Phishing Email	Jupyter Notebook
Fake News Classification Dataset	Kaggle	·1: Truthful News·0: Fake News	Jupyter Notebook
Language Classification Dataset	Kaggle	English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek	Jupyter Notebook
Youtube Spam Dataset	Kaggle	·Spam Comment: 1·Not Spam Comment: 0	Jupyter Notebook

Table of Contents:

Disclaimer:

This blogpost has information up to date to the date it has been drafted and written, i.e. November 12th 2024.

This is an ongoing research topic, and this blogpost is an incomplete, but hopefully useful, current snapshot of its ongoing research questions.

This blogpost focuses on utilizing a mix of Zero-shot / Few-shot / Prompting, essentially fancy words to say to use and call an API and add some sprinkle sentences around it, like these:

https://hussainpoonawala.medium.com/text-classification-with-large-language-models-llms-a23c731a687e

https://arxiv.org/html/2405.10523v1

https://arxiv.org/html/2409.01466v1

https://arxiv.org/pdf/2305.08377

In the future, I will focus in another blogpost on how to leverage fine-tuning / embeddings / llm ensambling to enhance classification with LLMs, like actual interesting work leveraging model weights and embeddings stuff.

Don’t forget to spend time reading what the community thinks as well, sometimes it’s quite hilarious as here ( https://www.reddit.com/r/MachineLearning/comments/1c4a7sa/d_are_traditional_nlp_tasks_such_as_text/): “Can you kill a mosquito with a bazooka? Yes. Is it the most efficient tool to do so? No.”

Enough fluff, enjoy this little blogpost now!

🫖 The tea about LLMs classification:

Pro number #1:

with LLMs, text classification is easier to do, just a couple of lines of code
it's heavily prompt dependent and code dependent (must use instructor library and pydantic classes)

It’s literally as easy as setting up this code structure:

import instructor
from pydantic import BaseModel, Field
from openai import OpenAI

# Initialize the OpenAI client
client = instructor.patch(OpenAI())

# Define the data model for classification
class YTCategory(str, Enum):
    SPAM = "Spam Comment" 
    NOT_SPAM = "Not Spam Comment" 

class YTClassification(BaseModel):
    category: YTCategory
    confidence: float = Field(ge=0, le=1, description="Confidence score for the classification")

To understand what I did above more in depth, look at this blogpost explaining in detail how to leverage LLMs potential with the Instructor library and the Pydantic library.

Blogpost Suggestion: Bridging Language Model with Python using Instructor, Pydantic, and OpenAI's Function Calling.