languagechange package¶

Submodules¶

languagechange.benchmark module¶

languagechange.benchmark.purity(labels_true, cluster_labels)[source]¶

class languagechange.benchmark.Benchmark[source]¶

Bases: object

get_dataset(key)[source]¶

get_train()[source]¶

get_dev()[source]¶

get_test()[source]¶

get_all_data()[source]¶

split_train_dev_test(train_prop=0.8, dev_prop=0.1, test_prop=0.1, shuffle=True, epsilon=1e-06)[source]¶

get_data_by_word(dataset, word)[source]¶

word_index_to_char_indices(text, word_index, split_text=False)[source]¶

class languagechange.benchmark.SemanticChangeEvaluationDataset(dataset=None, language=None, version=None, name=None)[source]¶

Bases: Benchmark

load_from_target_usages(target_usages, scores)[source]¶

Parameters:: target_usages (Dict[str, List[TargetUsage]])

evaluate_cd(predictions)[source]¶

Evaluates binary change detection by comparing the predictions to the change scores in self.binary_task.

Parameters:

(Union[List[Int] (predictions) – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.
Dict[Str – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.

Returns:

the percentage of correct predictions.

Return type:

(numpy.float64) An accuracy score

evaluate_gcd(predictions)[source]¶

Evaluates graded change detection by comparing the predictions to the change scores in self.graded_task.

Parameters:

(Union[List[Int] (predictions) – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.
Dict[Str – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.

Returns:

(scipy.stats._stats_py.SignificanceResult[numpy.float64, numpy.float64]) The Spearman correlation (rho, p) between the predictions and the gold labels.

class languagechange.benchmark.SemEval2020Task1(language, subset=None, config='opt')[source]¶

Bases: SemanticChangeEvaluationDataset

Parameters:: subset (int)

load()[source]¶

get_word_usages(word, group='all')[source]¶

class languagechange.benchmark.DWUG(path=None, dataset=None, language=None, version=None, subset=None, config='opt')[source]¶

Bases: SemanticChangeEvaluationDataset

load()[source]¶

get_usage_graph(word)[source]¶

show_usage_graph(word, config=None)[source]¶

get_word_usages(word, group='all')[source]¶

annotate_word(word, model, metric, prompt_template="Please tell me how similar the meaning of the word '{target}' is in the following example sentences: \n1. {usage_1}\n2. {usage_2}")[source]¶

Compares all usages of the target word in question and uses a model to compute judgments of their pairwise similarities, and saves the judgments to data/word/judgments.csv. :param word: the target word to annotate. :type word: str :param model: the model to use to annotate the usages. :type model: Union[ContextualizedModel, DefinitionGenerator, PromptModel] :param metric: if a ContextualizedModel or DefinitionGenerator is used, the metric to use to compute similarity between two vectors. Supported string values are ‘cosine’, ‘durel’ and ‘binary’. Alternatively, a function taking two vectors as input and returning a similarity score can be passed. If a PromptModel is used, this argument is ignored. :type metric: str or Callable :param prompt_template: if a PromptModel is used, the template to use for the user message in the prompt. The template must contain the placeholders ‘{target}’, ‘{usage_1}’ and ‘{usage_2}’. :type prompt_template: str

Parameters:: metric (str | Callable)

annotate_all_words(model, metric=None)[source]¶

Annotates all target words in the dataset using the given model and metric (if applicable), and saves the judgments to data/word/judgments.csv.

Parameters:: metric (str | Callable)

get_word_annotations(word, only_between_groups=False, remove_outliers=False, exclude_non_judgments=False, transform_labels=None, return_list=False)[source]¶

Finds the judgments for a given word in the DWUG. :param only_between_groups: if true, select only examples where the two usages belong to different groupings. :type only_between_groups: bool :param remove_outliers: if true, remove all examples which have been not been assigned to a cluster (cluster label = -1). :type remove_outliers: bool :param exclude_non_judgments: if true, remove all pairs of usages for which there is no judgment (label = 0). :type exclude_non_judgments: bool :param transform_labels: a function which takes a list of labels and returns a label. By default, all labels are kept. As a string, only ‘mean’ is supported. :type transform_labels: Callable :param return_list: if true, return the judgments as a list. :type return_list: bool

Returns:: {‘word’: word, ‘id1’: id1, ‘text1’: text1, ‘start1’: start1, ‘end1’: end1, ‘id2’: id2, ‘text2’: text2, ‘start2’: start2, ‘end2’: end2, ‘label’: label}} or a list of such dictionaries if return_list is true.
Return type:: (Dict[frozenset, Dict] or List[Dict]) a dictionary {frozenset([id1, id2])
Parameters:: transform_labels (Callable | str)

get_stats()[source]¶

get_stats_groupings()[source]¶

cast_to_WiC(only_between_groups=False, remove_outliers=True, exclude_non_judgments=True, transform_labels='mean')[source]¶

Casts the DWUG to a Word in Context (WiC) dataset.

Parameters:

only_between_groups (bool) – if true, select only examples where the two usages belong to different groupings.
remove_outliers (bool) – if true, remove all examples which have been not been assigned to a cluster (cluster label = -1).
exclude_non_judgments (bool) – if true, remove all pairs of usages for which there is no judgment (label = 0).
transform_labels (Callable|str) – a function or a string denoting a function (see self.get_word_annotation) which takes a list of labels and returns a label, by default the mean of the labels.

get_usages_and_senses(remove_outliers=True)[source]¶

cast_to_WSD(remove_outliers=True)[source]¶

cast_to_WSI(remove_outliers=True)[source]¶

cluster_evaluation(predictions, metrics={'ari', 'purity'}, remove_outliers=True)[source]¶

evaluate_cd(predictions)[source]¶

Evaluates binary change detection by comparing the predictions to the change scores in self.binary_task.

Parameters:

(Union[List[Int] (predictions) – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.
Dict[Str – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.

Returns:

the percentage of correct predictions.

Return type:

(numpy.float64) An accuracy score

evaluate_gcd(predictions)[source]¶

Evaluates graded change detection by comparing the predictions to the change scores in self.graded_task.

Parameters:

(Union[List[Int] (predictions) – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.
Dict[Str – Int]]): either a list of predictions (0 or 1) in the same order as the keys of self.stats_groupings or a dictionary {target_word: prediction}.

Returns:

(scipy.stats._stats_py.SignificanceResult[numpy.float64, numpy.float64]) The Spearman correlation (rho, p) between the predictions and the gold labels.

class languagechange.benchmark.WiC(path=None, wic_data=None, dataset=None, version=None, language=None, linguality=None, name=None)[source]¶

Bases: Benchmark

Dataset handling for the Word-in-Context (WiC) task. :param path: a path to the dataset, if it is not stored by the resource hub. :type path: str :param dataset: the dataset to be loaded. One of [‘WiC’, ‘XL-WiC’, ‘TempoWiC’, ‘MCL-WiC’, ‘AM2iCo’] if using a dataset in the language change resource hub, or a list or a dict if loading from a datastructure already describing a WiC dataset. :type dataset: str|list|dict :param version: the version of the dataset if using a dataset from the resource hub. :type version: str :param language: the language code (e.g. AR), if loading a multi- or crosslingual dataset. :type language: str :param linguality: whether to use the crosslingual or multilingual dataset, in the case of MCL-WiC. :type linguality: str :param name: the name of the dataset (in case no values for dataset, language and version are specified). :type name: str

Parameters:

path (str)
wic_data (dict | list)
dataset (str)
version (str)
language (str)
linguality (str)
name (str)

load_from_data(data)[source]¶

find_data_paths()[source]¶

load_from_txt(filename, word_indexes=False, index_to_offsets=None, field_map={'end1': 3, 'end2': 5, 'label': 8, 'start1': 2, 'start2': 4, 'text1': 6, 'text2': 7, 'word': 0}, skiplines=0)[source]¶

Parameters:: word_indexes (bool)

load_from_files(data_paths)[source]¶

format_label(label)[source]¶

load_from_resource_hub()[source]¶

load_from_target_usages(target_usages, labels)[source]¶

Parameters:: target_usages (List[Tuple[TargetUsage] | List[TargetUsage] | TargetUsageList])

evaluate(predictions, dataset, metric, word=None)[source]¶

Evaluates predictions by comparing them to the true labels of the dataset. :param predictions: the predictions. If a dict, id:s are expected in both this dict and the dataset to compare against. :type predictions: Union[List[Dict], Dict] :param dataset: one of [‘train’,’dev’,’test’,’dev_larger’,…] :type dataset: str :param metric: a metric such as scipy.stats.spearmanr, that can be used to compare the predictions. :type metric: Callable

Parameters:

predictions (List[Dict] | Dict)
metric (Callable)

evaluate_spearman(predictions, dataset='test', word=None)[source]¶

Parameters:: predictions (List[Dict] | Dict)

evaluate_accuracy(predictions, dataset='test', word=None)[source]¶

Parameters:: predictions (List[Dict] | Dict)

evaluate_f1(predictions, dataset='test', word=None, average='macro')[source]¶

Parameters:: predictions (List[Dict] | Dict)

class languagechange.benchmark.WSD(path=None, wsd_data=None, dataset=None, language=None, version=None, name=None)[source]¶

Bases: Benchmark

Dataset handling for the Word Sense Disambiguation (WSD) task. :param path: a path to the dataset, if it is not stored by the resource hub in the cache folder. :type path: str :param dataset: the dataset to be loaded. ‘XL-WSD’ if using a dataset from the language change resource hub, or a list or a dict if loading from a datastructure already describing a WSD dataset. :type dataset: str|list|dict :param version: the version of the dataset if using a dataset in the resource hub. :type version: str :param language: the language code (e.g. BG). :type language: str :param name: the name of the dataset (in case no values for dataset, language and version are specified). :type name: str

Parameters:

path (str)
wsd_data (list | dict)
dataset (str)
language (str)
version (str)
name (str)

load_from_data(data)[source]¶

find_data_paths(dataset, language)[source]¶

read_xml(path)[source]¶

load_from_files(data_paths, dataset)[source]¶

Loads a dataset from paths to train, dev and test sets (possibly None).

Parameters:

data_paths (Dict[Dict[str, str],str]) – a dictionary containing the paths to the different parts of the dataset, formatted as in self.find_data_paths().
dataset (str) – the name of the dataset.

load(dataset, language)[source]¶

load_from_target_usages(target_usages, labels)[source]¶

Parameters:: target_usages (List[TargetUsage] | TargetUsageList)

cast_to_WSI()[source]¶

evaluate(predictions, dataset, metric, word=None)[source]¶

Evaluates predictions by comparing them to the true labels of the dataset. :param predictions: the predictions. If a dict, id:s are expected in both this dict and the dataset to compare against. :type predictions: Union[List[Dict], Dict] :param dataset: one of [‘train’,’dev’,’test’,’dev_larger’,…] :type dataset: str :param metric: a metric such as scipy.stats.spearmanr, that can be used to compare the predictions. Either a function or a string to which there is a function associated. :type metric: Union[Callable, str]

Parameters:: predictions (List[Dict] | Dict)

evaluate_accuracy(predictions, dataset='test', word=None)[source]¶

Parameters:: predictions (List[Dict] | Dict)

evaluate_f1(predictions, dataset='test', word=None, average='macro')[source]¶

Parameters:: predictions (List[Dict] | Dict)

class languagechange.benchmark.WSI(wsi_data=None, dataset=None, version=None, language=None, name=None)[source]¶

Bases: Benchmark

Dataset handling for the Word Sense Induction (WSI) task. :param dataset: a datastructure describing a WSI dataset. :type dataset: list|dict :param name: the name of the dataset (optional but useful for evaluation pipelines). :type name: str

Parameters:

wsi_data (list | dict)
dataset (str)
version (str)
language (str)
name (str)

load_from_data(data)[source]¶

load_from_target_usages(target_usages, labels)[source]¶

Parameters:: target_usages (List[TargetUsage] | TargetUsageList)

evaluate(predictions, metrics={'ari', 'purity'}, dataset='all', average=False)[source]¶

Evaluates a clustering with respect to the true labels as given in self.data.

Parameters:

({str (predictions) – str|int}|[str|int]): a clustering as either a dictionary {id: cluster} or list .[cluster] of usage assignments. If it is a list, it is expected to be in the same order as the dataset evaluated on.
metric (function|str) – the metric to use for evaluation, such as RI, ARI or purity.
dataset (str) – the sub-dataset to use, e.g. ‘test’ or ‘all’.

Returns:

float}): the score for each word.

Return type:

scores ({str

evaluate_ari(predictions, dataset='all', average=False)[source]¶

evaluate_purity(predictions, dataset='all', average=False)[source]¶

languagechange.cache module¶

class languagechange.cache.CacheManager(cache_dir=None)[source]¶

Bases: object

Manages cache files with atomic write operations to prevent data corruption in concurrent environments. The cache files are saved to a directory that can be specified during initialization.

atomic_write(path)[source]¶

Provides a context manager for writing to cache files in an atomic way. This ensures that partial writes do not corrupt the target file, especially when multiple processes or threads access the same file.

Parameters:

path (str) – The relative path to the cache file within the cache directory.

Yields:

file object –

A writable file object for writing data. The file is temporary: and will be renamed to the target file path after the write operation completes successfully.

languagechange.corpora module¶

class languagechange.corpora.Line(raw_text=None, tokens=None, lemmas=None, pos_tags=None, fname=None, raw_lemma_text=None, raw_pos_text=None, **kwargs)[source]¶

Bases: object

tokens()[source]¶

lemmas()[source]¶

pos_tags()[source]¶

tokens_by_feature(feat=<class 'str'>)[source]¶

raw_text()[source]¶

raw_lemma_text()[source]¶

raw_pos_text()[source]¶

raw_text_by_feature(feat='token')[source]¶

search(search_term, time=None)[source]¶

Searches the line given a search_term.

Parameters:: search_term (SearchTerm) – SearchTerm
Return type:: TargetUsageList

Returns: A TargetUsageList of all matches.

class languagechange.corpora.Corpus(name, language=None, time=no time specification, time_function=None, skip_lines=0, **args)[source]¶

Bases: object

set_sentences_iterator(sentences)[source]¶

search(search_terms)[source]¶

Searches through the corpora by calling Line.search() on all lines.

Parameters:: search_terms (List[str | Pattern | SearchTerm]) – List[ str | Pattern | SearchTerm ] If a search term is str or Pattern it is converted to a SearchTerm and matches tokens only SearchTerm(word_feature = ‘token’).
Return type:: UsageDictionary

Returns: A UsageDictionary containing all search results for each search term.

tokenize(tokenizer='trankit', split_sentences=False, batch_size=128)[source]¶

lemmatize(lemmatizer='trankit', pretokenized=False, tokenize=False, split_sentences=False, batch_size=128)[source]¶

pos_tagging(pos_tagger='trankit', pretokenized=False, tokenize=False, split_sentences=False, batch_size=128)[source]¶

tokens_lemmas_pos_tags(nlp_model='trankit', tokens=True, split_sentences=False, batch_size=128)[source]¶

segment_sentences(segmentizer='trankit', batch_size=128)[source]¶

folder_iterator(path)[source]¶

cast_to_vertical(vertical_corpus)[source]¶

save()[source]¶

save_tokenized_corpora(tokens=True, lemmas=False, pos=False, save_format='linebyline', file_specification=None, file_ending='.txt', tokenizer='trankit', lemmatizer='trankit', pos_tagger='trankit', split_sentences=True, batch_size=128)[source]¶

Parameters:: corpora (Self | List[Self])

class languagechange.corpora.LinebyLineCorpus(path, **kwargs)[source]¶

Bases: Corpus

line_iterator()[source]¶

class languagechange.corpora.VerticalCorpus(path, sentence_separator='\n', field_separator='\t', field_map={'lemma': 1, 'pos_tag': 2, 'token': 0}, **args)[source]¶

Bases: Corpus

line_iterator()[source]¶

class languagechange.corpora.XMLCorpus(path, sentence_tag='sentence', token_tag='token', is_lemmatized=False, lemma_tag=None, is_pos_tagged=False, pos_tag_tag=None, text_tag='text', **args)[source]¶

Bases: Corpus

get_attribute(tag, attribute)[source]¶

line_iterator()[source]¶

cast_to_linebyline(linebyline_corpus)[source]¶

Parameters:: linebyline_corpus (LinebyLineCorpus)

cast_to_vertical(vertical_corpus)[source]¶

Parameters:: vertical_corpus (VerticalCorpus)

class languagechange.corpora.SprakBankenCorpus(path, sentence_tag='sentence', token_tag='token', is_lemmatized=True, lemma_tag='lemma', is_pos_tagged=True, pos_tag_tag='pos', **args)[source]¶

Bases: XMLCorpus

get_attribute(tag, attribute)[source]¶

class languagechange.corpora.HistoricalCorpus(*args, **kwargs)[source]¶

Bases: SortedKeyList

line_iterator()[source]¶: Iterates through all of the corpora, and yields all of the lines that are possible to get.

search(search_terms, index_by_corpus=False)[source]¶

Searches through all of the corpora by calling search() for each of them.

Parameters:

search_terms (List[str | Pattern | SearchTerm]) – List[ str | Pattern | SearchTerm ] If search term is str or Pattern it is converted to a SearchTerm and matches tokens only SearchTerm(word_feature = ‘token’).
index_by_corpus – bool, default False decides whether the usages for a given word should be a dictionary, with keys as the corpus names and values as lists of usages, or a list of all usages across corpora.

Returns: a dictionary containing all search results from the included corpora.

languagechange.evaluation module¶

class languagechange.evaluation.Evaluation[source]¶

Bases: object

evaluate_binary()[source]¶

evaluate_graded()[source]¶

print_tsv(input_result, output_source)[source]¶

print_lateX(input_result, output_source)[source]¶

print_console(input_result, output_source)[source]¶

class languagechange.evaluation.DWUGEvaluation[source]¶: Bases: Evaluation

languagechange.resource_manager module¶

class languagechange.resource_manager.LanguageChange[source]¶

Bases: object

load_resources_hub()[source]¶

download_ui()[source]¶

download(resource_type, resource_name, dataset, version)[source]¶

get_resource(resource_type, resource_name, dataset, version)[source]¶

save_resource(resource_type, resource_name, dataset, version)[source]¶

languagechange.search module¶

languagechange.search.expand_dictionary(words)[source]¶

Parameters:: words (List[str])

class languagechange.search.SearchTerm(term, regex=False, word_feature='token')[source]¶

Bases: object

Parameters:

term (str)
regex (bool)
word_feature (str | Set)

VALID_WORD_FEATURES = ['lemma', 'token', 'pos']¶

languagechange.usages module¶

class languagechange.usages.POS(*values)[source]¶

Bases: Enum

NOUN = 1¶

VERB = 2¶

ADJECTIVE = 3¶

ADVERB = 4¶

class languagechange.usages.Target(target)[source]¶

Bases: object

Parameters:: target (str)

set_lemma(lemma)[source]¶

Parameters:: lemma (str)

set_pos(pos)[source]¶

Parameters:: pos (POS)

class languagechange.usages.TargetUsage(text, offsets, time=None, **kwargs)[source]¶

Bases: object

Parameters:

text (str)
offsets (str)
time (Time)

text()[source]¶

start()[source]¶

end()[source]¶

time()[source]¶

to_dict()[source]¶

class languagechange.usages.DWUGUsage(target, date, grouping, identifier, description, **args)[source]¶: Bases: TargetUsage

class languagechange.usages.TargetUsageList(iterable=(), /)[source]¶

Bases: list

save(path, target)[source]¶

load(target)[source]¶

time_axis()[source]¶

to_dict()[source]¶

class languagechange.usages.UsageDictionary[source]¶

Bases: dict

save(path, words={})[source]¶

load(path, words={})[source]¶

languagechange.utils module¶

class languagechange.utils.Time[source]¶: Bases: object

class languagechange.utils.LiteralTime(time)[source]¶

Bases: Time

Parameters:: time (str)

class languagechange.utils.NumericalTime(time)[source]¶

Bases: Time

Parameters:: time (Number)

class languagechange.utils.TimeInterval(start, end)[source]¶

Bases: Time

Parameters:

start (Time)
end (Time)

languagechange package¶

Subpackages¶

Submodules¶

languagechange.benchmark module¶

languagechange.cache module¶

languagechange.corpora module¶

languagechange.evaluation module¶

languagechange.resource_manager module¶

languagechange.search module¶

languagechange.usages module¶

languagechange.utils module¶

Module contents¶