pipelines.classifier

Wrappers for different approaches to text classification, including scikit-learn text classification, text classification with Hugging Face Transformers, and few-shot classification (via SetFit).

source

ClassifierBase


def ClassifierBase(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

Helper class that provides a standard way to create an ABC using inheritance.


source

ClassifierBase.arrays2dataset


def arrays2dataset(
    X:List, y:Union, text_key:str='text', label_key:str='label'
):

Convert train or test examples to HF dataset


source

ClassifierBase.dataset2arrays


def dataset2arrays(
    dataset, text_key:str='text', label_key:str='label'
):

Convert a Hugging Face dataset to X, y arrays


source

ClassifierBase.evaluate


def evaluate(
    X_eval:list, y_eval:list, print_report:bool=True, labels:list=[], kwargs:VAR_KEYWORD
):

Evaluates labeled data using the trained model. If print_report is True, prints classification report and returns nothing. Otherwise, returns and prints a dictionary of the results. Extra kwargs fed to self.predict.


source

ClassifierBase.explain


def explain(
    X:list, labels:list=[]
):

Explain the predictions on given examples in X. (Requires shap and matplotlib to be installed.)


source

ClassifierBase.sample_examples


def sample_examples(
    X:list, y:list, num_samples:int=8, text_key:str='text', label_key:str='label'
):

Sample a dataset with num_samples per class


source

SKClassifier


def SKClassifier(
    model_path:NoneType=None, labels:list=[], kwargs:VAR_KEYWORD
):

Helper class that provides a standard way to create an ABC using inheritance.


source

SKClassifier.train


def train(
    X:List, y:Union, kwargs:VAR_KEYWORD
):

Trains the classifier on a list of texts (X) and a list of labels (y). Additional keyword arguments are passed directly to self.model.fit.

Args:

  • X: List of texts
  • y: List representing labels

Returns:

  • None

source

SKClassifier.predict


def predict(
    X, kwargs:VAR_KEYWORD
):

predict labels


source

SKClassifier.predict_proba


def predict_proba(
    X, kwargs:VAR_KEYWORD
):

predict label probabilities


source

SKClassifier.save


def save(
    filename:str
):

Save model to specified filename (e.g., /tmp/mymodel.gz). Model saved as pickle file. To reload the model, supply model_path when instantiatingSKClassifier.

Example: Training Sckit-Learn Text Classification Models

from sklearn.datasets import fetch_20newsgroups
from onprem.pipelines.classifier import SKClassifier
categories = [
             "alt.atheism",
             "soc.religion.christian",
             "comp.graphics",
             "sci.med" ]

train_b = fetch_20newsgroups(
            subset="train", categories=categories, shuffle=True, random_state=42
)
test_b = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42
)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target
classes = train_b.target_names

# y_test = [classes[y] for y in y_test]
# y_train = [classes[y] for y in y_train]

clf = SKClassifier(labels=classes)
clf.train(x_train, y_train)
test_doc1 = "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc2 = "The graphics on my monitor are terrible."
print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)
soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.94      0.84      0.89       396
soc.religion.christian       0.91      0.96      0.94       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502
clf.evaluate(x_test, y_test, print_report=False)['accuracy']
0.9114513981358189
clf.save('/tmp/mymodel.gz') # save
clf = SKClassifier(model_path='/tmp/mymodel.gz', labels=classes) # reload
clf.evaluate(x_test, y_test)
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.94      0.84      0.89       396
soc.religion.christian       0.91      0.96      0.94       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

source

HFClassifier


def HFClassifier(
    model_id_or_path:str='google/bert_uncased_L-2_H-128_A-2', device:NoneType=None, labels:list=[],
    kwargs:VAR_KEYWORD
):

Helper class that provides a standard way to create an ABC using inheritance.


source

HFClassifier.train


def train(
    X:List, y:Union, kwargs:VAR_KEYWORD
):

Trains the classifier on a list of texts (X) and a list of labels (y). Extra kwargs are treated as arguments to transformers.TrainingArguments.

Args:

  • X: List of texts
  • y: List of integers representing labels
  • num_epochs: Number of epochs to train
  • batch_size: Batch size
  • metric: metric to use
  • callbacks: A list of callbacks to customize the training loop.

Returns:

  • None

source

HFClassifier.predict


def predict(
    X:list, max_length:int=512, kwargs:VAR_KEYWORD
):

Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.


source

HFClassifier.predict_proba


def predict_proba(
    X:list, max_length:int=512, wargs:VAR_KEYWORD
):

Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.

The default model is a tiny BERT model (i.e., `google/bert_uncased_L-2_H-128_A-2), but we will use a larger model here to improve accuracy (e.g., distilbert).

Example: Training Hugging Face Transformer Models

categories = [
             "alt.atheism",
             "soc.religion.christian",
             "comp.graphics",
             "sci.med" ]

train_b = fetch_20newsgroups(
            subset="train", categories=categories, shuffle=True, random_state=42
)
test_b = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42
)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target
classes = train_b.target_names

clf = HFClassifier(model_id_or_path='distilbert/distilbert-base-uncased', 
                   device='cuda', labels=classes)
clf.train(x_train, y_train, num_train_epochs=1, per_device_train_batch_size=8)
test_doc1 = "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc2 = "The graphics on my monitor are terrible."
print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[283/283 02:25, Epoch 1/1]
Step Training Loss

soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
                        precision    recall  f1-score   support

           alt.atheism       0.89      0.88      0.89       319
         comp.graphics       0.97      0.98      0.97       389
               sci.med       0.97      0.95      0.96       396
soc.religion.christian       0.94      0.96      0.95       398

              accuracy                           0.95      1502
             macro avg       0.94      0.94      0.94      1502
          weighted avg       0.95      0.95      0.95      1502
clf.explain(test_doc1)
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[0]
outputs
alt.atheism
comp.graphics
sci.med
soc.religion.christian


0.50.30.10.70.900base value00falt.atheism(inputs)
inputs
0.0
0.0
Jesus
0.0
Christ
0.0
was
0.0
a
0.0
first
0.0
century
0.0
Jewish
0.0
teacher
0.0
and
0.0
religious
0.0
leader
0.0
.
0.0
clf.save('/tmp/my_hf_model')
clf = HFClassifier('/tmp/my_hf_model', device='cuda', labels=classes)
clf.evaluate(x_test,  y_test, print_report=False)['accuracy']
0.9454061251664447

source

FewShotClassifier


def FewShotClassifier(
    model_id_or_path:str='sentence-transformers/paraphrase-mpnet-base-v2', use_smaller:bool=False,
    kwargs:VAR_KEYWORD
):

Helper class that provides a standard way to create an ABC using inheritance.


source

FewShotClassifier.train


def train(
    X:List, y:Union, num_epochs:int=10, batch_size:int=32, metric:str='accuracy', callbacks:NoneType=None,
    kwargs:VAR_KEYWORD
):

Trains the classifier on a list of texts (X) and a list of labels (y). Additional keyword arguments are passed directly to SetFit.TrainingArguments

Args:

  • X: List of texts
  • y: List of integers representing labels
  • num_epochs: Number of epochs to train
  • batch_size: Batch size
  • metric: metric to use
  • callbacks: A list of callbacks to customize the training loop.

Returns:

  • None

source

FewShotClassifier.predict


def predict(
    X, kwargs:VAR_KEYWORD
):

predict labels


source

FewShotClassifier.predict_proba


def predict_proba(
    X, kwargs:VAR_KEYWORD
):

predict label probabilities


source

FewShotClassifier.save


def save(
    save_path:str
):

Save model to specified folder path, save_path. To reload the model, supply path in model_id_or_path argument when instantiatingFewShotClassifier.

Example: Training Few-Shot Text Classifiers

clf = FewShotClassifier(labels=['negative', 'positive'])
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
from datasets import load_dataset

Sample a tiny dataset with only 8 examples per class (or 16 total examples):

dataset = load_dataset("SetFit/sst2")
X_train, y_train = clf.dataset2arrays(dataset["train"], text_key="text", label_key="label")
X_test, y_test = clf.dataset2arrays(dataset["test"], text_key="text", label_key="label")
X_sample, y_sample = clf.sample_examples(X_train,  y_train, label_key="label", num_samples=8)
Repo card metadata block was not found. Setting CardData to empty.
clf.train(X_sample,  y_sample, max_steps=50)
Applying column mapping to the training dataset
***** Running training *****
  Num unique pairs = 144
  Batch size = 32
  Num epochs = 10
[50/50 00:28, Epoch 10/10]
Step Training Loss
1 0.242700
50 0.047300

clf.evaluate(X_test, y_test)
              precision    recall  f1-score   support

    negative       0.88      0.94      0.91       912
    positive       0.93      0.87      0.90       909

    accuracy                           0.91      1821
   macro avg       0.91      0.91      0.91      1821
weighted avg       0.91      0.91      0.91      1821
new_data = ["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"]
preds = clf.predict(new_data)
preds
['positive', 'negative']
preds = clf.predict_proba(new_data)
preds
tensor([[0.1657, 0.8343],
        [0.8551, 0.1449]], dtype=torch.float64)
clf.save('/tmp/my_fewshot_model')
clf = FewShotClassifier('/tmp/my_fewshot_model', labels=['negative', 'positive'])
preds = clf.predict(new_data)
preds
['positive', 'negative']
clf.explain(new_data)


[0]
outputs
negative
positive


0.50.30.1-0.10.70.91.10.6836380.683638base value0.1657120.165712fnegative(inputs)0.069 movie 0.017 spider 0.006 man 0.0 -0.292 loved -0.249 ! -0.067 the -0.002 i -0.0
inputs
-0.0
-0.002
i
-0.292
loved
-0.067
the
0.017
spider
0.006
man
0.069
movie
-0.249
!
0.0


[1]
outputs
negative
positive


0.50.30.1-0.10.70.91.10.651280.65128base value0.8551250.855125fnegative(inputs)0.53 worst 0.068 on -0.085 pizza -0.07 is -0.065 apple -0.063 pine -0.061 the -0.05 🤮 -0.0 -0.0
inputs
-0.0
-0.063
pine
-0.065
apple
0.068
on
-0.085
pizza
-0.07
is
-0.061
the
0.53
worst
-0.05
🤮
-0.0