from sklearn.datasets import fetch_20newsgroups
from onprem.pipelines.classifier import SKClassifier
pipelines.classifier
ClassifierBase
ClassifierBase ()
Helper class that provides a standard way to create an ABC using inheritance.
ClassifierBase.arrays2dataset
ClassifierBase.arrays2dataset (X:List[str], y:Union[List[int],List[str]], text_key:str='text', label_key:str='label')
Convert train or test examples to HF dataset
ClassifierBase.dataset2arrays
ClassifierBase.dataset2arrays (dataset, text_key:str='text', label_key:str='label')
Convert a Hugging Face dataset to X, y arrays
ClassifierBase.evaluate
ClassifierBase.evaluate (X_eval:list, y_eval:list, print_report:bool=True, labels:list=[], **kwargs)
Evaluates labeled data using the trained model. If print_report
is True, prints classification report and returns nothing. Otherwise, returns and prints a dictionary of the results. Extra kwargs fed to self.predict
.
ClassifierBase.explain
ClassifierBase.explain (X:list, labels:list=[])
Explain the predictions on given examples in X
. (Requires shap
and matplotlib
to be installed.)
ClassifierBase.sample_examples
ClassifierBase.sample_examples (X:list, y:list, num_samples:int=8, text_key:str='text', label_key:str='label')
Sample a dataset with num_samples
per class
SKClassifier
SKClassifier (model_path=None, labels=[], **kwargs)
Helper class that provides a standard way to create an ABC using inheritance.
SKClassifier.train
SKClassifier.train (X:List[str], y:Union[List[int],List[str]], **kwargs)
*Trains the classifier on a list of texts (X
) and a list of labels (y
). Additional keyword arguments are passed directly to self.model.fit
.
Args:
- X: List of texts
- y: List representing labels
Returns:
- None*
SKClassifier.predict
SKClassifier.predict (X, **kwargs)
predict labels
SKClassifier.predict_proba
SKClassifier.predict_proba (X, **kwargs)
predict label probabilities
SKClassifier.save
SKClassifier.save (filename:str)
Save model to specified filename
(e.g., /tmp/mymodel.gz
). Model saved as pickle file. To reload the model, supply model_path
when instantiatingSKClassifier
.
Example: Training Sckit-Learn Text Classification Models
= [
categories "alt.atheism",
"soc.religion.christian",
"comp.graphics",
"sci.med" ]
= fetch_20newsgroups(
train_b ="train", categories=categories, shuffle=True, random_state=42
subset
)= fetch_20newsgroups(
test_b ="test", categories=categories, shuffle=True, random_state=42
subset
)= train_b.data
x_train = train_b.target
y_train = test_b.data
x_test = test_b.target
y_test = train_b.target_names
classes
# y_test = [classes[y] for y in y_test]
# y_train = [classes[y] for y in y_train]
= SKClassifier(labels=classes)
clf
clf.train(x_train, y_train)= "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc1 = "The graphics on my monitor are terrible."
test_doc2 print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)
soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
precision recall f1-score support
alt.atheism 0.93 0.87 0.90 319
comp.graphics 0.88 0.96 0.92 389
sci.med 0.94 0.84 0.89 396
soc.religion.christian 0.91 0.96 0.94 398
accuracy 0.91 1502
macro avg 0.91 0.91 0.91 1502
weighted avg 0.91 0.91 0.91 1502
=False)['accuracy'] clf.evaluate(x_test, y_test, print_report
0.9114513981358189
'/tmp/mymodel.gz') # save clf.save(
= SKClassifier(model_path='/tmp/mymodel.gz', labels=classes) # reload
clf clf.evaluate(x_test, y_test)
precision recall f1-score support
alt.atheism 0.93 0.87 0.90 319
comp.graphics 0.88 0.96 0.92 389
sci.med 0.94 0.84 0.89 396
soc.religion.christian 0.91 0.96 0.94 398
accuracy 0.91 1502
macro avg 0.91 0.91 0.91 1502
weighted avg 0.91 0.91 0.91 1502
HFClassifier
HFClassifier (model_id_or_path:str='google/bert_uncased_L-2_H-128_A-2', device=None, labels=[], **kwargs)
Helper class that provides a standard way to create an ABC using inheritance.
HFClassifier.train
HFClassifier.train (X:List[str], y:Union[List[int],List[str]], **kwargs)
*Trains the classifier on a list of texts (X
) and a list of labels (y
). Extra kwargs are treated as arguments to transformers.TrainingArguments
.
Args:
- X: List of texts
- y: List of integers representing labels
- num_epochs: Number of epochs to train
- batch_size: Batch size
- metric: metric to use
- callbacks: A list of callbacks to customize the training loop.
Returns:
- None*
HFClassifier.predict
HFClassifier.predict (X:list, max_length=512, **kwargs)
Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.
HFClassifier.predict_proba
HFClassifier.predict_proba (X:list, max_length=512, **wargs)
Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.
The default model is a tiny BERT model (i.e., `google/bert_uncased_L-2_H-128_A-2
), but we will use a larger model here to improve accuracy (e.g., distilbert).
Example: Training Hugging Face Transformer Models
= [
categories "alt.atheism",
"soc.religion.christian",
"comp.graphics",
"sci.med" ]
= fetch_20newsgroups(
train_b ="train", categories=categories, shuffle=True, random_state=42
subset
)= fetch_20newsgroups(
test_b ="test", categories=categories, shuffle=True, random_state=42
subset
)= train_b.data
x_train = train_b.target
y_train = test_b.data
x_test = test_b.target
y_test = train_b.target_names
classes
= HFClassifier(model_id_or_path='distilbert/distilbert-base-uncased',
clf ='cuda', labels=classes)
device=1, per_device_train_batch_size=8)
clf.train(x_train, y_train, num_train_epochs= "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc1 = "The graphics on my monitor are terrible."
test_doc2 print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step | Training Loss |
---|
soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
precision recall f1-score support
alt.atheism 0.89 0.88 0.89 319
comp.graphics 0.97 0.98 0.97 389
sci.med 0.97 0.95 0.96 396
soc.religion.christian 0.94 0.96 0.95 398
accuracy 0.95 1502
macro avg 0.94 0.94 0.94 1502
weighted avg 0.95 0.95 0.95 1502
clf.explain(test_doc1)
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
'/tmp/my_hf_model') clf.save(
= HFClassifier('/tmp/my_hf_model', device='cuda', labels=classes)
clf =False)['accuracy'] clf.evaluate(x_test, y_test, print_report
0.9454061251664447
FewShotClassifier
FewShotClassifier (model_id_or_path:str='sentence- transformers/paraphrase-mpnet-base-v2', use_smaller:bool=False, **kwargs)
Helper class that provides a standard way to create an ABC using inheritance.
FewShotClassifier.train
FewShotClassifier.train (X:List[str], y:Union[List[int],List[str]], num_epochs:int=10, batch_size:int=32, metric='accuracy', callbacks=None, **kwargs)
*Trains the classifier on a list of texts (X
) and a list of labels (y
). Additional keyword arguments are passed directly to SetFit.TrainingArguments
Args:
- X: List of texts
- y: List of integers representing labels
- num_epochs: Number of epochs to train
- batch_size: Batch size
- metric: metric to use
- callbacks: A list of callbacks to customize the training loop.
Returns:
- None*
FewShotClassifier.predict
FewShotClassifier.predict (X, **kwargs)
predict labels
FewShotClassifier.predict_proba
FewShotClassifier.predict_proba (X, **kwargs)
predict label probabilities
FewShotClassifier.save
FewShotClassifier.save (save_path:str)
Save model to specified folder path, save_path
. To reload the model, supply path in model_id_or_path
argument when instantiatingFewShotClassifier
.
Example: Training Few-Shot Text Classifiers
= FewShotClassifier(labels=['negative', 'positive']) clf
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
from datasets import load_dataset
Sample a tiny dataset with only 8 examples per class (or 16 total examples):
= load_dataset("SetFit/sst2")
dataset = clf.dataset2arrays(dataset["train"], text_key="text", label_key="label")
X_train, y_train = clf.dataset2arrays(dataset["test"], text_key="text", label_key="label")
X_test, y_test = clf.sample_examples(X_train, y_train, label_key="label", num_samples=8) X_sample, y_sample
Repo card metadata block was not found. Setting CardData to empty.
=50) clf.train(X_sample, y_sample, max_steps
Applying column mapping to the training dataset
***** Running training *****
Num unique pairs = 144
Batch size = 32
Num epochs = 10
Step | Training Loss |
---|---|
1 | 0.242700 |
50 | 0.047300 |
clf.evaluate(X_test, y_test)
precision recall f1-score support
negative 0.88 0.94 0.91 912
positive 0.93 0.87 0.90 909
accuracy 0.91 1821
macro avg 0.91 0.91 0.91 1821
weighted avg 0.91 0.91 0.91 1821
= ["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"] new_data
= clf.predict(new_data)
preds preds
['positive', 'negative']
= clf.predict_proba(new_data)
preds preds
tensor([[0.1657, 0.8343],
[0.8551, 0.1449]], dtype=torch.float64)
'/tmp/my_fewshot_model') clf.save(
= FewShotClassifier('/tmp/my_fewshot_model', labels=['negative', 'positive'])
clf = clf.predict(new_data)
preds preds
['positive', 'negative']
clf.explain(new_data)