languagechange package¶

Subpackages¶

languagechange.models package

Submodules¶

languagechange.benchmark module¶

languagechange.cache module¶

Cache manager with atomic write helpers for file-based caches.

class languagechange.cache.CacheManager(cache_dir=None)[source]¶

Bases: object

Manages cache files with atomic write operations to prevent data corruption in concurrent environments. The cache files are saved to a directory that can be specified during initialization.

atomic_write(path)[source]¶

Provides a context manager for writing to cache files in an atomic way. This ensures that partial writes do not corrupt the target file, especially when multiple processes or threads access the same file.

Parameters:

path (str) – The relative path to the cache file within the cache directory.

Yields:

file object –

A writable file object for writing data. The file is temporary: and will be renamed to the target file path after the write operation completes successfully.

languagechange.corpora module¶

Corpus utilities for line-level corpora and search helpers.

class languagechange.corpora.Line(raw_text=None, tokens=None, lemmas=None, pos_tags=None, fname=None, raw_lemma_text=None, raw_pos_text=None, **kwargs)[source]¶

Bases: object

Wraps a corpus line with token, lemma, and POS metadata.

tokens()[source]¶

lemmas()[source]¶

pos_tags()[source]¶

tokens_by_feature(feat=<class 'str'>)[source]¶

raw_text()[source]¶

raw_lemma_text()[source]¶

raw_pos_text()[source]¶

raw_text_by_feature(feat='token')[source]¶

search(search_term, time=None)[source]¶

Searches the line given a search_term.

Parameters:: search_term (SearchTerm) – SearchTerm
Return type:: TargetUsageList

Returns: A TargetUsageList of all matches.

class languagechange.corpora.Corpus(name, language=None, time=no time specification, time_function=None, skip_lines=0, **args)[source]¶

Bases: object

Base interface for corpora that support search and tokenization.

set_sentences_iterator(sentences)[source]¶

search(search_terms)[source]¶

Searches through the corpora by calling Line.search() on all lines.

Parameters:: search_terms (List[str | Pattern | SearchTerm]) – List[ str | Pattern | SearchTerm ] If a search term is str or Pattern it is converted to a SearchTerm and matches tokens only SearchTerm(word_feature = ‘token’).
Return type:: UsageDictionary

Returns: A UsageDictionary containing all search results for each search term.

tokenize(tokenizer='trankit', split_sentences=False, batch_size=128)[source]¶

Yield tokenized sentences using Trankit, optionally splitting sentences.

Parameters:

tokenizer (str, optional) – Tokenizer backend. Defaults to “trankit”.
split_sentences (bool, optional) – Split paragraphs into sentences. Defaults to False.
batch_size (int, optional) – Number of lines to accumulate before processing. Defaults to 128.

lemmatize(lemmatizer='trankit', pretokenized=False, tokenize=False, split_sentences=False, batch_size=128)[source]¶

pos_tagging(pos_tagger='trankit', pretokenized=False, tokenize=False, split_sentences=False, batch_size=128)[source]¶

tokens_lemmas_pos_tags(nlp_model='trankit', tokens=True, split_sentences=False, batch_size=128)[source]¶

segment_sentences(segmentizer='trankit', batch_size=128)[source]¶

folder_iterator(path)[source]¶

cast_to_vertical(vertical_corpus)[source]¶

save()[source]¶

save_tokenized_corpora(tokens=True, lemmas=False, pos=False, save_format='linebyline', file_specification=None, file_ending='.txt', tokenizer='trankit', lemmatizer='trankit', pos_tagger='trankit', split_sentences=True, batch_size=128)[source]¶

Parameters:: corpora (Self | List[Self])

class languagechange.corpora.LinebyLineCorpus(path, **kwargs)[source]¶

Bases: Corpus

line_iterator()[source]¶

class languagechange.corpora.VerticalCorpus(path, sentence_separator='\n', field_separator='\t', field_map={'lemma': 1, 'pos_tag': 2, 'token': 0}, **args)[source]¶

Bases: Corpus

line_iterator()[source]¶

class languagechange.corpora.XMLCorpus(path, sentence_tag='sentence', token_tag='token', is_lemmatized=False, lemma_tag=None, is_pos_tagged=False, pos_tag_tag=None, text_tag='text', **args)[source]¶

Bases: Corpus

get_attribute(tag, attribute)[source]¶

line_iterator()[source]¶

cast_to_linebyline(linebyline_corpus)[source]¶

Parameters:: linebyline_corpus (LinebyLineCorpus)

cast_to_vertical(vertical_corpus)[source]¶

Parameters:: vertical_corpus (VerticalCorpus)

class languagechange.corpora.SprakBankenCorpus(path, sentence_tag='sentence', token_tag='token', is_lemmatized=True, lemma_tag='lemma', is_pos_tagged=True, pos_tag_tag='pos', **args)[source]¶

Bases: XMLCorpus

get_attribute(tag, attribute)[source]¶

class languagechange.corpora.HistoricalCorpus(*args, **kwargs)[source]¶

Bases: SortedKeyList

line_iterator()[source]¶: Iterates through all of the corpora, and yields all of the lines that are possible to get.

search(search_terms, index_by_corpus=False)[source]¶

Searches through all of the corpora by calling search() for each of them.

Parameters:

search_terms (List[str | Pattern | SearchTerm]) – List[ str | Pattern | SearchTerm ] If search term is str or Pattern it is converted to a SearchTerm and matches tokens only SearchTerm(word_feature = ‘token’).
index_by_corpus – bool, default False decides whether the usages for a given word should be a dictionary, with keys as the corpus names and values as lists of usages, or a list of all usages across corpora.

Returns: a dictionary containing all search results from the included corpora.

languagechange.evaluation module¶

languagechange.resource_manager module¶

Resource manager that downloads and caches datasets and models.

class languagechange.resource_manager.LanguageChange[source]¶

Bases: object

load_resources_hub()[source]¶: Refresh the resource hub index from GitHub.

download_ui()[source]¶: Interactive prompt for selecting and downloading resources.

download(resource_type, resource_name, dataset, version)[source]¶: Download and cache a resource from the resource hub.

get_resource(resource_type, resource_name, dataset, version)[source]¶: Return the path to a cached resource, downloading it if necessary.

save_resource(resource_type, resource_name, dataset, version)[source]¶: Create a local resource directory for persistent storage.

languagechange.search module¶

Helper utilities for searching corpora for target terms.

languagechange.search.expand_dictionary(words)[source]¶

Placeholder for future dictionary expansion utilities.

Parameters:: words (List[str]) – Words to expand into additional search terms.

class languagechange.search.SearchTerm(term, regex=False, word_feature='token')[source]¶

Bases: object

Describes a search target and the features to scan within a corpus line.

Parameters:

term (str)
regex (bool)
word_feature (str | Set)

VALID_WORD_FEATURES = ['lemma', 'token', 'pos']¶

languagechange.usages module¶

Target usage helpers and containers for LanguageChange.

class languagechange.usages.POS(*values)[source]¶

Bases: Enum

Enumeration of supported parts of speech for targets.

NOUN = 1¶

VERB = 2¶

ADJECTIVE = 3¶

ADVERB = 4¶

class languagechange.usages.Target(target)[source]¶

Bases: object

Stores a target word together with optional metadata.

Parameters:: target (str)

set_lemma(lemma)[source]¶

Parameters:: lemma (str)

set_pos(pos)[source]¶

Parameters:: pos (POS)

class languagechange.usages.TargetUsage(text, offsets, time=None, **kwargs)[source]¶

Bases: object

Represents an individual usage with offsets and optional time metadata.

Parameters:

text (str)
offsets (str)
time (Time)

text()[source]¶

start()[source]¶

end()[source]¶

time()[source]¶

to_dict()[source]¶

class languagechange.usages.DWUGUsage(target, date, grouping, identifier, description, **args)[source]¶

Bases: TargetUsage

DWUG-specific usage metadata, including annotator judgments.

class languagechange.usages.TargetUsageList(iterable=(), /)[source]¶

Bases: list

List of TargetUsage instances with serialization helpers.

save(path, target)[source]¶

load(target)[source]¶

time_axis()[source]¶

to_dict()[source]¶

class languagechange.usages.UsageDictionary[source]¶

Bases: dict

Dictionary mapping words to TargetUsageList instances.

save(path, words={})[source]¶

load(path, words={})[source]¶

languagechange.utils module¶

Simple time representations used across the LanguageChange toolkit.

class languagechange.utils.Time[source]¶: Bases: object

class languagechange.utils.LiteralTime(time)[source]¶

Bases: Time

Represents a literal timestamp or label for usage references.

Parameters:: time (str)

class languagechange.utils.NumericalTime(time)[source]¶

Bases: Time

Numeric timestamp (e.g., time slice) that supports comparisons.

Parameters:: time (Number)

class languagechange.utils.TimeInterval(start, end)[source]¶

Bases: Time

Represents an interval between two Time points.

Parameters:

start (Time)
end (Time)

Module contents¶

Core exports for the LanguageChange toolkit.