pipelines.classifier

Wrappers for different approaches to text classification, including scikit-learn text classification, text classification with Hugging Face Transformers, and few-shot classification (via SetFit).

source

ClassifierBase

 ClassifierBase ()

Helper class that provides a standard way to create an ABC using inheritance.


source

ClassifierBase.arrays2dataset

 ClassifierBase.arrays2dataset (X:List[str], y:Union[List[int],List[str]],
                                text_key:str='text',
                                label_key:str='label')

Convert train or test examples to HF dataset


source

ClassifierBase.dataset2arrays

 ClassifierBase.dataset2arrays (dataset, text_key:str='text',
                                label_key:str='label')

Convert a Hugging Face dataset to X, y arrays


source

ClassifierBase.evaluate

 ClassifierBase.evaluate (X_eval:list, y_eval:list,
                          print_report:bool=True, labels:list=[],
                          **kwargs)

Evaluates labeled data using the trained model. If print_report is True, prints classification report and returns nothing. Otherwise, returns and prints a dictionary of the results. Extra kwargs fed to self.predict.


source

ClassifierBase.explain

 ClassifierBase.explain (X:list, labels:list=[])

Explain the predictions on given examples in X. (Requires shap and matplotlib to be installed.)


source

ClassifierBase.sample_examples

 ClassifierBase.sample_examples (X:list, y:list, num_samples:int=8,
                                 text_key:str='text',
                                 label_key:str='label')

Sample a dataset with num_samples per class


source

SKClassifier

 SKClassifier (model_path=None, labels=[], **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.


source

SKClassifier.train

 SKClassifier.train (X:List[str], y:Union[List[int],List[str]], **kwargs)

*Trains the classifier on a list of texts (X) and a list of labels (y). Additional keyword arguments are passed directly to self.model.fit.

Args:

  • X: List of texts
  • y: List representing labels

Returns:

  • None*

source

SKClassifier.predict

 SKClassifier.predict (X, **kwargs)

predict labels


source

SKClassifier.predict_proba

 SKClassifier.predict_proba (X, **kwargs)

predict label probabilities


source

SKClassifier.save

 SKClassifier.save (filename:str)

Save model to specified filename (e.g., /tmp/mymodel.gz). Model saved as pickle file. To reload the model, supply model_path when instantiatingSKClassifier.

Example: Training Sckit-Learn Text Classification Models

from sklearn.datasets import fetch_20newsgroups
from onprem.pipelines.classifier import SKClassifier
categories = [
             "alt.atheism",
             "soc.religion.christian",
             "comp.graphics",
             "sci.med" ]

train_b = fetch_20newsgroups(
            subset="train", categories=categories, shuffle=True, random_state=42
)
test_b = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42
)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target
classes = train_b.target_names

# y_test = [classes[y] for y in y_test]
# y_train = [classes[y] for y in y_train]

clf = SKClassifier(labels=classes)
clf.train(x_train, y_train)
test_doc1 = "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc2 = "The graphics on my monitor are terrible."
print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)
soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.94      0.84      0.89       396
soc.religion.christian       0.91      0.96      0.94       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502
clf.evaluate(x_test, y_test, print_report=False)['accuracy']
0.9114513981358189
clf.save('/tmp/mymodel.gz') # save
clf = SKClassifier(model_path='/tmp/mymodel.gz', labels=classes) # reload
clf.evaluate(x_test, y_test)
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.94      0.84      0.89       396
soc.religion.christian       0.91      0.96      0.94       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

source

HFClassifier

 HFClassifier (model_id_or_path:str='google/bert_uncased_L-2_H-128_A-2',
               device=None, labels=[], **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.


source

HFClassifier.train

 HFClassifier.train (X:List[str], y:Union[List[int],List[str]], **kwargs)

*Trains the classifier on a list of texts (X) and a list of labels (y). Extra kwargs are treated as arguments to transformers.TrainingArguments.

Args:

  • X: List of texts
  • y: List of integers representing labels
  • num_epochs: Number of epochs to train
  • batch_size: Batch size
  • metric: metric to use
  • callbacks: A list of callbacks to customize the training loop.

Returns:

  • None*

source

HFClassifier.predict

 HFClassifier.predict (X:list, max_length=512, **kwargs)

Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.


source

HFClassifier.predict_proba

 HFClassifier.predict_proba (X:list, max_length=512, **wargs)

Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.

The default model is a tiny BERT model (i.e., `google/bert_uncased_L-2_H-128_A-2), but we will use a larger model here to improve accuracy (e.g., distilbert).

Example: Training Hugging Face Transformer Models

categories = [
             "alt.atheism",
             "soc.religion.christian",
             "comp.graphics",
             "sci.med" ]

train_b = fetch_20newsgroups(
            subset="train", categories=categories, shuffle=True, random_state=42
)
test_b = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42
)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target
classes = train_b.target_names

clf = HFClassifier(model_id_or_path='distilbert/distilbert-base-uncased', 
                   device='cuda', labels=classes)
clf.train(x_train, y_train, num_train_epochs=1, per_device_train_batch_size=8)
test_doc1 = "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc2 = "The graphics on my monitor are terrible."
print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[283/283 02:25, Epoch 1/1]
Step Training Loss

soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
                        precision    recall  f1-score   support

           alt.atheism       0.89      0.88      0.89       319
         comp.graphics       0.97      0.98      0.97       389
               sci.med       0.97      0.95      0.96       396
soc.religion.christian       0.94      0.96      0.95       398

              accuracy                           0.95      1502
             macro avg       0.94      0.94      0.94      1502
          weighted avg       0.95      0.95      0.95      1502
clf.explain(test_doc1)
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[0]
outputs
alt.atheism
comp.graphics
sci.med
soc.religion.christian


0.50.30.10.70.900base value00falt.atheism(inputs)
inputs
0.0
0.0
Jesus
0.0
Christ
0.0
was
0.0
a
0.0
first
0.0
century
0.0
Jewish
0.0
teacher
0.0
and
0.0
religious
0.0
leader
0.0
.
0.0
clf.save('/tmp/my_hf_model')
clf = HFClassifier('/tmp/my_hf_model', device='cuda', labels=classes)
clf.evaluate(x_test,  y_test, print_report=False)['accuracy']
0.9454061251664447

source

FewShotClassifier

 FewShotClassifier (model_id_or_path:str='sentence-
                    transformers/paraphrase-mpnet-base-v2',
                    use_smaller:bool=False, **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.


source

FewShotClassifier.train

 FewShotClassifier.train (X:List[str], y:Union[List[int],List[str]],
                          num_epochs:int=10, batch_size:int=32,
                          metric='accuracy', callbacks=None, **kwargs)

*Trains the classifier on a list of texts (X) and a list of labels (y). Additional keyword arguments are passed directly to SetFit.TrainingArguments

Args:

  • X: List of texts
  • y: List of integers representing labels
  • num_epochs: Number of epochs to train
  • batch_size: Batch size
  • metric: metric to use
  • callbacks: A list of callbacks to customize the training loop.

Returns:

  • None*

source

FewShotClassifier.predict

 FewShotClassifier.predict (X, **kwargs)

predict labels


source

FewShotClassifier.predict_proba

 FewShotClassifier.predict_proba (X, **kwargs)

predict label probabilities


source

FewShotClassifier.save

 FewShotClassifier.save (save_path:str)

Save model to specified folder path, save_path. To reload the model, supply path in model_id_or_path argument when instantiatingFewShotClassifier.

Example: Training Few-Shot Text Classifiers

clf = FewShotClassifier(labels=['negative', 'positive'])
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
from datasets import load_dataset

Sample a tiny dataset with only 8 examples per class (or 16 total examples):

dataset = load_dataset("SetFit/sst2")
X_train, y_train = clf.dataset2arrays(dataset["train"], text_key="text", label_key="label")
X_test, y_test = clf.dataset2arrays(dataset["test"], text_key="text", label_key="label")
X_sample, y_sample = clf.sample_examples(X_train,  y_train, label_key="label", num_samples=8)
Repo card metadata block was not found. Setting CardData to empty.
clf.train(X_sample,  y_sample, max_steps=50)
Applying column mapping to the training dataset
***** Running training *****
  Num unique pairs = 144
  Batch size = 32
  Num epochs = 10
[50/50 00:28, Epoch 10/10]
Step Training Loss
1 0.242700
50 0.047300

clf.evaluate(X_test, y_test)
              precision    recall  f1-score   support

    negative       0.88      0.94      0.91       912
    positive       0.93      0.87      0.90       909

    accuracy                           0.91      1821
   macro avg       0.91      0.91      0.91      1821
weighted avg       0.91      0.91      0.91      1821
new_data = ["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"]
preds = clf.predict(new_data)
preds
['positive', 'negative']
preds = clf.predict_proba(new_data)
preds
tensor([[0.1657, 0.8343],
        [0.8551, 0.1449]], dtype=torch.float64)
clf.save('/tmp/my_fewshot_model')
clf = FewShotClassifier('/tmp/my_fewshot_model', labels=['negative', 'positive'])
preds = clf.predict(new_data)
preds
['positive', 'negative']
clf.explain(new_data)


[0]
outputs
negative
positive


0.50.30.1-0.10.70.91.10.6836380.683638base value0.1657120.165712fnegative(inputs)0.069 movie 0.017 spider 0.006 man 0.0 -0.292 loved -0.249 ! -0.067 the -0.002 i -0.0
inputs
-0.0
-0.002
i
-0.292
loved
-0.067
the
0.017
spider
0.006
man
0.069
movie
-0.249
!
0.0


[1]
outputs
negative
positive


0.50.30.1-0.10.70.91.10.651280.65128base value0.8551250.855125fnegative(inputs)0.53 worst 0.068 on -0.085 pizza -0.07 is -0.065 apple -0.063 pine -0.061 the -0.05 🤮 -0.0 -0.0
inputs
-0.0
-0.063
pine
-0.065
apple
0.068
on
-0.085
pizza
-0.07
is
-0.061
the
0.53
worst
-0.05
🤮
-0.0