languagechange.models.representation package

Submodules

languagechange.models.representation.alignment module

class languagechange.models.representation.alignment.OrthogonalProcrustes(savepath1, savepath2)[source]

Bases: object

A class to align word embeddings using the Orthogonal Procrustes method.

This method aligns two embedding spaces by finding an optimal orthogonal transformation.

Parameters:
  • savepath1 (str)

  • savepath2 (str)

align(model1, model2, encoding='utf-8', precision='fp32', cuda=False, batch_size=10000, seed=0, supervised=None, semi_supervised=None, identical=False, unsupervised=False, acl2018=False, aaai2018=None, acl2017=False, acl2017_seed=None, emnlp2016=None, init_dictionary=0, init_identical=True, init_numerals=False, init_unsupervised=False, unsupervised_vocab=0, normalize=['unit'], whiten=False, src_reweight=0, trg_reweight=0, src_dewhiten=None, trg_dewhiten=None, dim_reduction=0, orthogonal=True, unconstrained=False, self_learning=False, vocabulary_cutoff=0, direction='union', csls_neighborhood=0, threshold=1e-06, validation=None, stochastic_initial=0.1, stochastic_multiplier=2.0, stochastic_interval=50, log=None, verbose=False)[source]

Perform orthogonal alignment between two embedding models using a subprocess.

Parameters:
  • model1 (StaticModel) – The first static word embedding model to align.

  • model2 (StaticModel) – The second static word embedding model to align.

  • batch_size (int)

  • unsupervised_vocab (int)

  • src_reweight (float)

  • trg_reweight (float)

  • dim_reduction (int)

  • vocabulary_cutoff (int)

  • csls_neighborhood (int)

  • threshold (float)

  • stochastic_initial (float)

  • stochastic_multiplier (float)

  • stochastic_interval (int)

languagechange.models.representation.contextualized module

languagechange.models.representation.contextualized.generate_cache_key(target_usages)[source]

Generate a unique cache key based on the input data.

class languagechange.models.representation.contextualized.ContextualizedModel(device='cuda', n_extra_tokens=0, cache_dir='~/.cache/languagechange/contextualized', *args, **kwargs)[source]

Bases: object

Abstract base class for contextualized embedding models.

Parameters:
  • device (str)

  • n_extra_tokens (int)

device

The device to run the model on (‘cuda’ or ‘cpu’).

Type:

str

n_extra_tokens

Additional tokens to consider during encoding.

Type:

int

abstractmethod encode(target_usages, batch_size=8)[source]

Encode target usages to generate embeddings.

Parameters:
  • target_usages (Union[TargetUsage, List[TargetUsage]]) – Usage data to encode.

  • batch_size (int) – Batch size for encoding. Defaults to 8.

Returns:

Encoded embeddings.

Return type:

np.array

Raises:
  • ValueError – If batch_size is not an integer.

  • ValueError – If target_usages is not a valid type.

class languagechange.models.representation.contextualized.ContextualizedEmbeddings[source]

Bases: object

Class to manage contextualized embeddings.

static from_usages(target_usages, raw_embedding)[source]
Parameters:
class languagechange.models.representation.contextualized.XL_LEXEME(pretrained_model='pierluigic/xl-lexeme', device='cuda', n_extra_tokens=0)[source]

Bases: ContextualizedModel

Contextualized model for XL-LEXEME embeddings.

Parameters:
  • pretrained_model (str)

  • device (str)

  • n_extra_tokens (int)

encode(target_usages, batch_size=8)[source]

Encode target usages with XL_LEXEME model.

Parameters:
  • target_usages (Union[TargetUsage, List[TargetUsage]]) – Usage data to encode.

  • batch_size (int) – Batch size for encoding. Defaults to 8.

Returns:

Encoded embeddings.

Return type:

np.array

class languagechange.models.representation.contextualized.BERT(pretrained_model, device='cuda', n_extra_tokens=2)[source]

Bases: ContextualizedModel

Contextualized model for BERT embeddings.

Parameters:
  • pretrained_model (str)

  • device (str)

  • n_extra_tokens (int)

split_context(target_usage)[source]

Split the target usage into left, target, and right context tokens.

Parameters:

target_usage (TargetUsage) – The usage data.

Returns:

Tokenized left, target, and right context.

Return type:

Tuple[List[str], List[str], List[str]]

center_usage(left_tokens, target_tokens, right_tokens)[source]

Adjust tokens to fit within the model’s maximum sequence length.

Parameters:
  • left_tokens (List[str]) – Tokens from left context.

  • target_tokens (List[str]) – Tokens from target usage.

  • right_tokens (List[str]) – Tokens from right context.

Returns:

Trimmed left, target, and right tokens.

Return type:

Tuple[List[str], List[str], List[str]]

add_special_tokens(left_tokens, target_tokens, right_tokens)[source]

Add special tokens to the tokenized sequences.

Parameters:
  • left_tokens (List[str]) – Left context tokens.

  • target_tokens (List[str]) – Target tokens.

  • right_tokens (List[str]) – Right context tokens.

Returns:

Tokenized sequences with special tokens.

Return type:

Tuple[List[str], List[str], List[str]]

process_input_tokens(tokens)[source]

Convert tokens to input IDs and attention masks for the model.

Parameters:

tokens (List[str]) – Tokens to be processed.

Returns:

Input IDs, attention masks, and token type IDs.

Return type:

dict[str, Union[list[int], Any]]

batch_encode(target_usages)[source]

Encode a batch of target usages and generate embeddings.

Parameters:

target_usages (List[TargetUsage]) – List of target usages.

Returns:

Batch of encoded embeddings.

Return type:

np.array

encode(target_usages, batch_size=8)[source]

Encode target usages in batches.

Parameters:
  • target_usages (Union[TargetUsage, List[TargetUsage]]) – List of target usages.

  • batch_size (int) – Batch size for encoding. Defaults to 8.

Returns:

Array of encoded embeddings.

Return type:

np.array

class languagechange.models.representation.contextualized.RoBERTa(pretrained_model, device='cuda', n_extra_tokens=2)[source]

Bases: BERT

Contextualized model for RoBERTa embeddings, inheriting from BERT.

Parameters:
  • pretrained_model (str)

  • device (str)

  • n_extra_tokens (int)

languagechange.models.representation.definition module

class languagechange.models.representation.definition.Message[source]

Bases: TypedDict

role: Literal['system', 'user']
content: str
class languagechange.models.representation.definition.DefinitionGenerator(embedding_model='all-mpnet-base-v2')[source]

Bases: object

Parameters:

embedding_model (str)

encode_definitions(definitions, encode='both')[source]
class languagechange.models.representation.definition.LlamaDefinitionGenerator(model_name, ft_model_name, hf_token, max_length=512, batch_size=32, max_time=4.5, temperature=1e-05, embedding_model='all-mpnet-base-v2', torch_dtype=torch.float16, low_cpu_mem_usage=False)[source]

Bases: DefinitionGenerator

A tool to create short, clear definitions for words based on example sentences using fine-tuned Llama models.

Parameters:
  • model_name (str)

  • ft_model_name (str)

  • hf_token (str)

  • max_length (int)

  • batch_size (int)

  • max_time (float)

  • temperature (float)

  • embedding_model (str)

model_name

Name of the base model (e.g., “meta-llama/Llama-2-7b-chat-hf”).

Type:

str

ft_model_name

Name of the fine-tuned model (e.g., “FrancescoPeriti/Llama2Dictionary”).

Type:

str

hf_token

Hugging Face token for authentication.

Type:

str

max_length

Maximum token length for the prompt.

Type:

int

batch_size

How many examples to process at once.

Type:

int

max_time

Maximum time (in seconds) allowed per batch.

Type:

float

temperature

Generation temperature.

Type:

float

model

The loaded fine-tuned model ready for generation.

tokenizer

The tokenizer that prepares text for the model.

eos_tokens

Tokens that signal the end of a definition.

apply_chat_template(dataset, system_message, template)[source]
tokenize(dataset)[source]
extract_definition(answer)[source]

Extracts the actual definition from the model’s response based on model type.

Parameters:

answer (str) – The text generated by the model.

Returns:

A cleaned-up definition string with proper formatting.

Return type:

str

Notes

  • For Llama-2: Extracts text after ‘[/INST]’.

  • For Llama-3: Takes the last line.

  • Warns if output appears abnormal.

generate_definitions(target_usages, system_message='You are a lexicographer familiar with providing concise definitions of word meanings.', template='Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}', encode_definitions=None)[source]

Generates definitions for all examples in batches using the model.

Parameters:
  • target_usages (List[TargetUsage]) – A list of TargetUsage objects.

  • system_message (str) – The system prompt message.

  • template (str) – The template for the user prompt with placeholders {target} and {example}.

  • encode_definitions (str)

Returns:

Generated definitions corresponding to each TargetUsage, as text, sentence embeddings, or both.

Return type:

Union[List[str], List[np.ndarray], Tuple[List[str],List[np.ndarray]]

print_results(target_usages, definitions)[source]

Displays the target word, example sentence, and generated definition for each entry.

Parameters:
  • target_usages (List[TargetUsage]) – List of TargetUsage objects.

  • definitions (List[str]) – List of generated definitions.

Return type:

None

run(target_usages, system_message='You are a lexicographer familiar with providing concise definitions of word meanings.', template='Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}')[source]

Executes the complete workflow from generating definitions to printing the results.

Parameters:
  • target_usages (List[TargetUsage]) – List of TargetUsage objects.

  • system_message (str) – The system prompt message.

  • template (str) – The template for the user prompt.

Return type:

None

class languagechange.models.representation.definition.DefinitionOutput(*args, **kwargs)[source]

Bases: BaseModel

Represents the structured output for a word definition.

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any

target

The target word.

Type:

str

example

The example sentence.

Type:

str

definition

The concise definition of the target word as used in the sentence.

Type:

str

class languagechange.models.representation.definition.ChatModelDefinitionGenerator(model_name, model_provider, langsmith_key=None, provider_key_name=None, provider_key=None, language=None, embedding_model='all-mpnet-base-v2')[source]

Bases: DefinitionGenerator

A model to generate concise definitions for target words using a chat model with structured output.

The model leverages an underlying chat model (initialized with LangChain) that returns a structured DefinitionOutput.

Parameters:
  • model_name (str)

  • model_provider (str)

  • langsmith_key (str)

  • provider_key_name (str)

  • provider_key (str)

  • language (str)

  • embedding_model (str)

generate_definitions(target_usages, user_prompt_template="Please provide a concise definition for the meaning of the word '{target}' as used in the following sentence:\nSentence: {example}", encode_definitions=None)[source]

Generates definitions for each TargetUsage using a chat model.

Parameters:
  • target_usages (List[TargetUsage]) – List of target usages.

  • user_prompt_template (str) – Template for the user prompt.

  • encode_definitions (str)

Returns:

Generated definitions corresponding to each TargetUsage, as text, sentence embeddings, or both.

Return type:

Union[List[str], List[np.ndarray], Tuple[List[str],List[np.ndarray]]

class languagechange.models.representation.definition.T5DefinitionGenerator(model_path, bsize=4, max_length=256, filter_target=True, sampling=False, temperature=1.0, repetition_penalty=1.0, num_beams=1, num_beam_groups=1)[source]

Bases: DefinitionGenerator

Generates word definitions using a T5 model.

load_data(path_to_data, split='test')[source]

Load and preprocess data from a file or directory.

Parameters:
  • path_to_data (str) – Path to input file or directory.

  • split (str) – Data split (e.g., ‘test’, ‘trial’). Defaults to ‘test’.

Returns:

Preprocessed data with ‘Targets’ and ‘Real_Contexts’ columns.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If the specified file or directory doesn’t exist.

encode_definitions(df, encode='both', return_df=False)[source]
generate_definitions(target_usage_list=None, df=None, prompt_index=8, encode_definitions=None, return_df=False)[source]

Generate definitions for words in the DataFrame.

Parameters:
  • df (pd.DataFrame) – Data with ‘Targets’ and ‘Context’ or ‘Real_Contexts’ columns.

  • prompt_index (int) – Index of the prompt template to use. Defaults to 8.

  • encode_definitions (str)

Returns:

Updated DataFrame with ‘Generated_Definition’ column.

Return type:

pd.DataFrame

Raises:

ValueError – If prompt_index is invalid or required columns are missing.

save_definitions(df, output_file)[source]

Save the DataFrame with definitions to a TSV file.

languagechange.models.representation.prompting module

class languagechange.models.representation.prompting.SCFloat(*args, **kwargs)[source]

Bases: BaseModel

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any

class languagechange.models.representation.prompting.SCDURel(*args, **kwargs)[source]

Bases: BaseModel

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any

class languagechange.models.representation.prompting.PromptModel(model_name, model_provider, langsmith_key=None, provider_key_name=None, provider_key=None, structure='float', language=None, **kwargs)[source]

Bases: object

Parameters:
  • model_name (str)

  • model_provider (str)

  • langsmith_key (str)

  • provider_key_name (str)

  • provider_key (str)

  • structure (str | pydantic.BaseModel)

  • language (str)

get_response(target_usages, system_message='You are a lexicographer', user_prompt_template="Please provide a number measuring how different the meaning of the word '{target}' is between the following example sentences: \n1. {usage_1}\n2. {usage_2}", lemmatize=True)[source]

Takes as input two target usages and returns the degree of semantic change between them, using a chat model with structured output. :param target_usages: a list of target usages with the same target word. :type target_usages: List[TargetUsage] :param system_message: the system message to use in the prompt :type system_message: str :param user_prompt_template: template to use for the user message in the prompt. :type user_prompt_template: str :param lemmatize: whether the target word should be lemmatized in the prompt or not. Uses trankit to lemmatize. :type lemmatize: bool

Returns:

the degree of semantic change between the two instances of the target word, alternatively the whole message content if the output is not structured.

Return type:

int or float or str

Parameters:

target_usages (List[TargetUsage])

languagechange.models.representation.static module

class languagechange.models.representation.static.RepresentationModel[source]

Bases: ABC

Abstract base class for all representation models. Provides a template for encoding methods.

abstractmethod encode(*args, **kwargs)[source]

Abstract method for encoding data into a vector representation. Should be implemented by subclasses.

class languagechange.models.representation.static.StaticModel(matrix_path=None, format='w2v')[source]

Bases: RepresentationModel, dict

Base class for static word embedding models. Manages loading and accessing vector spaces.

matrix_path

Path to the matrix file.

Type:

str

format

Format of the matrix file (e.g., ‘w2v’, ‘npz’).

Type:

str

abstractmethod encode()[source]

Abstract method to perform encoding operations. Must be implemented in subclasses.

abstractmethod load()[source]

Load the vector space from the specified file.

matrix()[source]

Retrieve the entire matrix of word vectors.

Returns:

The matrix of word vectors.

Return type:

scipy.sparse.spmatrix

Raises:

Exception – If the space is not loaded.

row2word()[source]

Retrieve the mapping of row indices to words.

Returns:

List of words corresponding to matrix rows.

Return type:

list

Raises:

Exception – If the space is not loaded.

class languagechange.models.representation.static.CountModel(corpus, window_size, savepath)[source]

Bases: StaticModel

Count-based word embedding model that builds a co-occurrence matrix from a corpus.

Parameters:
corpus

The corpus to process.

Type:

LinebyLineCorpus

window_size

The size of the context window.

Type:

int

savepath

Path to save the generated matrix.

Type:

str

encode(is_len=False)[source]

Build a co-occurrence matrix from the corpus and save it to the specified path.

class languagechange.models.representation.static.PPMI(count_model, shifting_parameter, smoothing_parameter, savepath)[source]

Bases: CountModel

Positive Pointwise Mutual Information (PPMI) model that transforms a co-occurrence matrix.

Parameters:
count_model

The count-based model to transform.

Type:

CountModel

shifting_parameter

Parameter to shift values after applying log weighting.

Type:

int

smoothing_parameter

Parameter to smooth the matrix values.

Type:

int

savepath

Path to save the PPMI matrix.

Type:

str

encode(is_len=False)[source]

Compute the smoothed and shifted PPMI matrix from a co-occurrence matrix. Smoothing is performed as described in

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. ACL, 3.

class languagechange.models.representation.static.SVD(count_model, dimensionality, gamma, savepath)[source]

Bases: StaticModel

Singular Value Decomposition (SVD) model that reduces the dimensionality of a matrix.

Parameters:
count_model

The input count-based model.

Type:

CountModel

dimensionality

Target dimensionality for the reduced matrix.

Type:

int

gamma

Weighting parameter for singular values.

Type:

float

savepath

Path to save the reduced matrix.

Type:

str

encode(is_len=False)[source]

Perform dimensionality reduction on a (normally PPMI) matrix by applying truncated SVD as described in

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. ACL, 3.

class languagechange.models.representation.static.RandomIndexing[source]

Bases: StaticModel

Random Indexing model that creates low-dimensional vector spaces from a co-occurrence matrix.

window_size

Size of the context window for random indexing.

Type:

int

encode(is_len=False)[source]

Create low-dimensional vector space by sparse random indexing from co-occurrence matrix.

Module contents