languagechange.models.representation package¶

Submodules¶

languagechange.models.representation.alignment module¶

class languagechange.models.representation.alignment.OrthogonalProcrustes(savepath1, savepath2)[source]¶

Bases: object

A class to align word embeddings using the Orthogonal Procrustes method.

This method aligns two embedding spaces by finding an optimal orthogonal transformation.

Parameters:

savepath1 (str)
savepath2 (str)

align(model1, model2, encoding='utf-8', precision='fp32', cuda=False, batch_size=10000, seed=0, supervised=None, semi_supervised=None, identical=False, unsupervised=False, acl2018=False, aaai2018=None, acl2017=False, acl2017_seed=None, emnlp2016=None, init_dictionary=0, init_identical=True, init_numerals=False, init_unsupervised=False, unsupervised_vocab=0, normalize=['unit'], whiten=False, src_reweight=0, trg_reweight=0, src_dewhiten=None, trg_dewhiten=None, dim_reduction=0, orthogonal=True, unconstrained=False, self_learning=False, vocabulary_cutoff=0, direction='union', csls_neighborhood=0, threshold=1e-06, validation=None, stochastic_initial=0.1, stochastic_multiplier=2.0, stochastic_interval=50, log=None, verbose=False)[source]¶

Perform orthogonal alignment between two embedding models using a subprocess.

Parameters:

model1 (StaticModel) – The first static word embedding model to align.
model2 (StaticModel) – The second static word embedding model to align.
batch_size (int)
unsupervised_vocab (int)
src_reweight (float)
trg_reweight (float)
dim_reduction (int)
vocabulary_cutoff (int)
csls_neighborhood (int)
threshold (float)
stochastic_initial (float)
stochastic_multiplier (float)
stochastic_interval (int)

languagechange.models.representation.contextualized module¶

languagechange.models.representation.contextualized.generate_cache_key(target_usages)[source]¶: Generate a unique cache key based on the input data.

class languagechange.models.representation.contextualized.ContextualizedModel(device='cuda', n_extra_tokens=0, cache_dir='~/.cache/languagechange/contextualized', *args, **kwargs)[source]¶

Bases: object

Abstract base class for contextualized embedding models.

Parameters:

device (str)
n_extra_tokens (int)

device¶

The device to run the model on (‘cuda’ or ‘cpu’).

Type:: str

n_extra_tokens¶

Additional tokens to consider during encoding.

Type:: int

abstractmethod encode(target_usages, batch_size=8)[source]¶

Encode target usages to generate embeddings.

Parameters:

target_usages (Union[TargetUsage, List[TargetUsage]]) – Usage data to encode.
batch_size (int) – Batch size for encoding. Defaults to 8.

Returns:

Encoded embeddings.

Return type:

np.array

Raises:

ValueError – If batch_size is not an integer.
ValueError – If target_usages is not a valid type.

class languagechange.models.representation.contextualized.ContextualizedEmbeddings[source]¶

Bases: object

Class to manage contextualized embeddings.

static from_usages(target_usages, raw_embedding)[source]¶

Parameters:

target_usages (List[TargetUsage])
raw_embedding (numpy.array)

class languagechange.models.representation.contextualized.XL_LEXEME(pretrained_model='pierluigic/xl-lexeme', device='cuda', n_extra_tokens=0)[source]¶

Bases: ContextualizedModel

Contextualized model for XL-LEXEME embeddings.

Parameters:

pretrained_model (str)
device (str)
n_extra_tokens (int)

encode(target_usages, batch_size=8)[source]¶

Encode target usages with XL_LEXEME model.

Parameters:

target_usages (Union[TargetUsage, List[TargetUsage]]) – Usage data to encode.
batch_size (int) – Batch size for encoding. Defaults to 8.

Returns:

Encoded embeddings.

Return type:

np.array

class languagechange.models.representation.contextualized.BERT(pretrained_model, device='cuda', n_extra_tokens=2)[source]¶

Bases: ContextualizedModel

Contextualized model for BERT embeddings.

Parameters:

pretrained_model (str)
device (str)
n_extra_tokens (int)

split_context(target_usage)[source]¶

Split the target usage into left, target, and right context tokens.

Parameters:: target_usage (TargetUsage) – The usage data.
Returns:: Tokenized left, target, and right context.
Return type:: Tuple[List[str], List[str], List[str]]

center_usage(left_tokens, target_tokens, right_tokens)[source]¶

Adjust tokens to fit within the model’s maximum sequence length.

Parameters:

left_tokens (List[str]) – Tokens from left context.
target_tokens (List[str]) – Tokens from target usage.
right_tokens (List[str]) – Tokens from right context.

Returns:

Trimmed left, target, and right tokens.

Return type:

Tuple[List[str], List[str], List[str]]

add_special_tokens(left_tokens, target_tokens, right_tokens)[source]¶

Add special tokens to the tokenized sequences.

Parameters:

left_tokens (List[str]) – Left context tokens.
target_tokens (List[str]) – Target tokens.
right_tokens (List[str]) – Right context tokens.

Returns:

Tokenized sequences with special tokens.

Return type:

Tuple[List[str], List[str], List[str]]

process_input_tokens(tokens)[source]¶

Convert tokens to input IDs and attention masks for the model.

Parameters:: tokens (List[str]) – Tokens to be processed.
Returns:: Input IDs, attention masks, and token type IDs.
Return type:: dict[str, Union[list[int], Any]]

batch_encode(target_usages)[source]¶

Encode a batch of target usages and generate embeddings.

Parameters:: target_usages (List[TargetUsage]) – List of target usages.
Returns:: Batch of encoded embeddings.
Return type:: np.array

encode(target_usages, batch_size=8)[source]¶

Encode target usages in batches.

Parameters:

target_usages (Union[TargetUsage, List[TargetUsage]]) – List of target usages.
batch_size (int) – Batch size for encoding. Defaults to 8.

Returns:

Array of encoded embeddings.

Return type:

np.array

class languagechange.models.representation.contextualized.RoBERTa(pretrained_model, device='cuda', n_extra_tokens=2)[source]¶

Bases: BERT

Contextualized model for RoBERTa embeddings, inheriting from BERT.

Parameters:

pretrained_model (str)
device (str)
n_extra_tokens (int)

languagechange.models.representation.definition module¶

class languagechange.models.representation.definition.Message[source]¶

Bases: TypedDict

role: Literal['system', 'user']¶

content: str¶

class languagechange.models.representation.definition.DefinitionGenerator(embedding_model='all-mpnet-base-v2')[source]¶

Bases: object

Parameters:: embedding_model (str)

encode_definitions(definitions, encode='both')[source]¶

class languagechange.models.representation.definition.LlamaDefinitionGenerator(model_name, ft_model_name, hf_token, max_length=512, batch_size=32, max_time=4.5, temperature=1e-05, embedding_model='all-mpnet-base-v2', torch_dtype=torch.float16, low_cpu_mem_usage=False)[source]¶

Bases: DefinitionGenerator

A tool to create short, clear definitions for words based on example sentences using fine-tuned Llama models.

Parameters:

model_name (str)
ft_model_name (str)
hf_token (str)
max_length (int)
batch_size (int)
max_time (float)
temperature (float)
embedding_model (str)

model_name¶

Name of the base model (e.g., “meta-llama/Llama-2-7b-chat-hf”).

Type:: str

ft_model_name¶

Name of the fine-tuned model (e.g., “FrancescoPeriti/Llama2Dictionary”).

Type:: str

hf_token¶

Hugging Face token for authentication.

Type:: str

max_length¶

Maximum token length for the prompt.

Type:: int

batch_size¶

How many examples to process at once.

Type:: int

max_time¶

Maximum time (in seconds) allowed per batch.

Type:: float

temperature¶

Generation temperature.

Type:: float

model¶: The loaded fine-tuned model ready for generation.

tokenizer¶: The tokenizer that prepares text for the model.

eos_tokens¶: Tokens that signal the end of a definition.

apply_chat_template(dataset, system_message, template)[source]¶

tokenize(dataset)[source]¶

extract_definition(answer)[source]¶

Extracts the actual definition from the model’s response based on model type.

Parameters:: answer (str) – The text generated by the model.
Returns:: A cleaned-up definition string with proper formatting.
Return type:: str

Notes

For Llama-2: Extracts text after ‘[/INST]’.
For Llama-3: Takes the last line.
Warns if output appears abnormal.

generate_definitions(target_usages, system_message='You are a lexicographer familiar with providing concise definitions of word meanings.', template='Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}', encode_definitions=None)[source]¶

Generates definitions for all examples in batches using the model.

Parameters:

target_usages (List[TargetUsage]) – A list of TargetUsage objects.
system_message (str) – The system prompt message.
template (str) – The template for the user prompt with placeholders {target} and {example}.
encode_definitions (str)

Returns:

Generated definitions corresponding to each TargetUsage, as text, sentence embeddings, or both.

Return type:

Union[List[str], List[np.ndarray], Tuple[List[str],List[np.ndarray]]

print_results(target_usages, definitions)[source]¶

Displays the target word, example sentence, and generated definition for each entry.

Parameters:

target_usages (List[TargetUsage]) – List of TargetUsage objects.
definitions (List[str]) – List of generated definitions.

Return type:

None

run(target_usages, system_message='You are a lexicographer familiar with providing concise definitions of word meanings.', template='Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}')[source]¶

Executes the complete workflow from generating definitions to printing the results.

Parameters:

target_usages (List[TargetUsage]) – List of TargetUsage objects.
system_message (str) – The system prompt message.
template (str) – The template for the user prompt.

Return type:

None

class languagechange.models.representation.definition.DefinitionOutput(*args, **kwargs)[source]¶

Bases: BaseModel

Represents the structured output for a word definition.

Parameters:

args (Any)
kwargs (Any)

Return type:

Any

target¶

The target word.

Type:: str

example¶

The example sentence.

Type:: str

definition¶

The concise definition of the target word as used in the sentence.

Type:: str

class languagechange.models.representation.definition.ChatModelDefinitionGenerator(model_name, model_provider, langsmith_key=None, provider_key_name=None, provider_key=None, language=None, embedding_model='all-mpnet-base-v2')[source]¶

Bases: DefinitionGenerator

A model to generate concise definitions for target words using a chat model with structured output.

The model leverages an underlying chat model (initialized with LangChain) that returns a structured DefinitionOutput.

Parameters:

model_name (str)
model_provider (str)
langsmith_key (str)
provider_key_name (str)
provider_key (str)
language (str)
embedding_model (str)

generate_definitions(target_usages, user_prompt_template="Please provide a concise definition for the meaning of the word '{target}' as used in the following sentence:\nSentence: {example}", encode_definitions=None)[source]¶

Generates definitions for each TargetUsage using a chat model.

Parameters:

target_usages (List[TargetUsage]) – List of target usages.
user_prompt_template (str) – Template for the user prompt.
encode_definitions (str)

Returns:

Generated definitions corresponding to each TargetUsage, as text, sentence embeddings, or both.

Return type:

Union[List[str], List[np.ndarray], Tuple[List[str],List[np.ndarray]]

class languagechange.models.representation.definition.T5DefinitionGenerator(model_path, bsize=4, max_length=256, filter_target=True, sampling=False, temperature=1.0, repetition_penalty=1.0, num_beams=1, num_beam_groups=1)[source]¶

Bases: DefinitionGenerator

Generates word definitions using a T5 model.

load_data(path_to_data, split='test')[source]¶

Load and preprocess data from a file or directory.

Parameters:

path_to_data (str) – Path to input file or directory.
split (str) – Data split (e.g., ‘test’, ‘trial’). Defaults to ‘test’.

Returns:

Preprocessed data with ‘Targets’ and ‘Real_Contexts’ columns.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If the specified file or directory doesn’t exist.

encode_definitions(df, encode='both', return_df=False)[source]¶

generate_definitions(target_usage_list=None, df=None, prompt_index=8, encode_definitions=None, return_df=False)[source]¶

Generate definitions for words in the DataFrame.

Parameters:

df (pd.DataFrame) – Data with ‘Targets’ and ‘Context’ or ‘Real_Contexts’ columns.
prompt_index (int) – Index of the prompt template to use. Defaults to 8.
encode_definitions (str)

Returns:

Updated DataFrame with ‘Generated_Definition’ column.

Return type:

pd.DataFrame

Raises:

ValueError – If prompt_index is invalid or required columns are missing.

save_definitions(df, output_file)[source]¶: Save the DataFrame with definitions to a TSV file.

languagechange.models.representation.prompting module¶

class languagechange.models.representation.prompting.SCFloat(*args, **kwargs)[source]¶

Bases: BaseModel

Parameters:

args (Any)
kwargs (Any)

Return type:

Any

class languagechange.models.representation.prompting.SCDURel(*args, **kwargs)[source]¶

Bases: BaseModel

Parameters:

args (Any)
kwargs (Any)

Return type:

Any

class languagechange.models.representation.prompting.PromptModel(model_name, model_provider, langsmith_key=None, provider_key_name=None, provider_key=None, structure='float', language=None, **kwargs)[source]¶

Bases: object

Parameters:

model_name (str)
model_provider (str)
langsmith_key (str)
provider_key_name (str)
provider_key (str)
structure (str | pydantic.BaseModel)
language (str)

get_response(target_usages, system_message='You are a lexicographer', user_prompt_template="Please provide a number measuring how different the meaning of the word '{target}' is between the following example sentences: \n1. {usage_1}\n2. {usage_2}", lemmatize=True)[source]¶

Takes as input two target usages and returns the degree of semantic change between them, using a chat model with structured output. :param target_usages: a list of target usages with the same target word. :type target_usages: List[TargetUsage] :param system_message: the system message to use in the prompt :type system_message: str :param user_prompt_template: template to use for the user message in the prompt. :type user_prompt_template: str :param lemmatize: whether the target word should be lemmatized in the prompt or not. Uses trankit to lemmatize. :type lemmatize: bool

Returns:: the degree of semantic change between the two instances of the target word, alternatively the whole message content if the output is not structured.
Return type:: int or float or str
Parameters:: target_usages (List[TargetUsage])

languagechange.models.representation.static module¶

class languagechange.models.representation.static.RepresentationModel[source]¶

Bases: ABC

Abstract base class for all representation models. Provides a template for encoding methods.

abstractmethod encode(*args, **kwargs)[source]¶: Abstract method for encoding data into a vector representation. Should be implemented by subclasses.

class languagechange.models.representation.static.StaticModel(matrix_path=None, format='w2v')[source]¶

Bases: RepresentationModel, dict

Base class for static word embedding models. Manages loading and accessing vector spaces.

matrix_path¶

Path to the matrix file.

Type:: str

format¶

Format of the matrix file (e.g., ‘w2v’, ‘npz’).

Type:: str

abstractmethod encode()[source]¶: Abstract method to perform encoding operations. Must be implemented in subclasses.

abstractmethod load()[source]¶: Load the vector space from the specified file.

matrix()[source]¶

Retrieve the entire matrix of word vectors.

Returns:: The matrix of word vectors.
Return type:: scipy.sparse.spmatrix
Raises:: Exception – If the space is not loaded.

row2word()[source]¶

Retrieve the mapping of row indices to words.

Returns:: List of words corresponding to matrix rows.
Return type:: list
Raises:: Exception – If the space is not loaded.

class languagechange.models.representation.static.CountModel(corpus, window_size, savepath)[source]¶

Bases: StaticModel

Count-based word embedding model that builds a co-occurrence matrix from a corpus.

Parameters:

corpus (LinebyLineCorpus)
window_size (int)
savepath (str)

corpus¶

The corpus to process.

Type:: LinebyLineCorpus

window_size¶

The size of the context window.

Type:: int

savepath¶

Path to save the generated matrix.

Type:: str

encode(is_len=False)[source]¶: Build a co-occurrence matrix from the corpus and save it to the specified path.

class languagechange.models.representation.static.PPMI(count_model, shifting_parameter, smoothing_parameter, savepath)[source]¶

Bases: CountModel

Positive Pointwise Mutual Information (PPMI) model that transforms a co-occurrence matrix.

Parameters:

count_model (CountModel)
shifting_parameter (int)
smoothing_parameter (int)
savepath (str)

count_model¶

The count-based model to transform.

Type:: CountModel

shifting_parameter¶

Parameter to shift values after applying log weighting.

Type:: int

smoothing_parameter¶

Parameter to smooth the matrix values.

Type:: int

savepath¶

Path to save the PPMI matrix.

Type:: str

encode(is_len=False)[source]¶

Compute the smoothed and shifted PPMI matrix from a co-occurrence matrix. Smoothing is performed as described in

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. ACL, 3.

class languagechange.models.representation.static.SVD(count_model, dimensionality, gamma, savepath)[source]¶

Bases: StaticModel

Singular Value Decomposition (SVD) model that reduces the dimensionality of a matrix.

Parameters:

count_model (CountModel)
dimensionality (int)
gamma (float)
savepath (str)

count_model¶

The input count-based model.

Type:: CountModel

dimensionality¶

Target dimensionality for the reduced matrix.

Type:: int

gamma¶

Weighting parameter for singular values.

Type:: float

savepath¶

Path to save the reduced matrix.

Type:: str

encode(is_len=False)[source]¶

Perform dimensionality reduction on a (normally PPMI) matrix by applying truncated SVD as described in

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. ACL, 3.

class languagechange.models.representation.static.RandomIndexing[source]¶

Bases: StaticModel

Random Indexing model that creates low-dimensional vector spaces from a co-occurrence matrix.

window_size¶

Size of the context window for random indexing.

Type:: int

encode(is_len=False)[source]¶: Create low-dimensional vector space by sparse random indexing from co-occurrence matrix.

languagechange.models.representation package¶

Submodules¶

languagechange.models.representation.alignment module¶

languagechange.models.representation.contextualized module¶

languagechange.models.representation.definition module¶

languagechange.models.representation.prompting module¶

languagechange.models.representation.static module¶

Module contents¶