languagechange.models.representation package¶
Submodules¶
languagechange.models.representation.alignment module¶
- class languagechange.models.representation.alignment.OrthogonalProcrustes(savepath1, savepath2)[source]¶
Bases:
objectA class to align word embeddings using the Orthogonal Procrustes method.
This method aligns two embedding spaces by finding an optimal orthogonal transformation.
- align(model1, model2, encoding='utf-8', precision='fp32', cuda=False, batch_size=10000, seed=0, supervised=None, semi_supervised=None, identical=False, unsupervised=False, acl2018=False, aaai2018=None, acl2017=False, acl2017_seed=None, emnlp2016=None, init_dictionary=0, init_identical=True, init_numerals=False, init_unsupervised=False, unsupervised_vocab=0, normalize=['unit'], whiten=False, src_reweight=0, trg_reweight=0, src_dewhiten=None, trg_dewhiten=None, dim_reduction=0, orthogonal=True, unconstrained=False, self_learning=False, vocabulary_cutoff=0, direction='union', csls_neighborhood=0, threshold=1e-06, validation=None, stochastic_initial=0.1, stochastic_multiplier=2.0, stochastic_interval=50, log=None, verbose=False)[source]¶
Perform orthogonal alignment between two embedding models using a subprocess.
- Parameters:
model1 (StaticModel) – The first static word embedding model to align.
model2 (StaticModel) – The second static word embedding model to align.
batch_size (int)
unsupervised_vocab (int)
src_reweight (float)
trg_reweight (float)
dim_reduction (int)
vocabulary_cutoff (int)
csls_neighborhood (int)
threshold (float)
stochastic_initial (float)
stochastic_multiplier (float)
stochastic_interval (int)
languagechange.models.representation.contextualized module¶
- languagechange.models.representation.contextualized.generate_cache_key(target_usages)[source]¶
Generate a unique cache key based on the input data.
- class languagechange.models.representation.contextualized.ContextualizedModel(device='cuda', n_extra_tokens=0, cache_dir='~/.cache/languagechange/contextualized', *args, **kwargs)[source]¶
Bases:
objectAbstract base class for contextualized embedding models.
- abstractmethod encode(target_usages, batch_size=8)[source]¶
Encode target usages to generate embeddings.
- Parameters:
target_usages (Union[TargetUsage, List[TargetUsage]]) – Usage data to encode.
batch_size (int) – Batch size for encoding. Defaults to 8.
- Returns:
Encoded embeddings.
- Return type:
np.array
- Raises:
ValueError – If batch_size is not an integer.
ValueError – If target_usages is not a valid type.
- class languagechange.models.representation.contextualized.ContextualizedEmbeddings[source]¶
Bases:
objectClass to manage contextualized embeddings.
- static from_usages(target_usages, raw_embedding)[source]¶
- Parameters:
target_usages (List[TargetUsage])
raw_embedding (numpy.array)
- class languagechange.models.representation.contextualized.XL_LEXEME(pretrained_model='pierluigic/xl-lexeme', device='cuda', n_extra_tokens=0)[source]¶
Bases:
ContextualizedModelContextualized model for XL-LEXEME embeddings.
- encode(target_usages, batch_size=8)[source]¶
Encode target usages with XL_LEXEME model.
- Parameters:
target_usages (Union[TargetUsage, List[TargetUsage]]) – Usage data to encode.
batch_size (int) – Batch size for encoding. Defaults to 8.
- Returns:
Encoded embeddings.
- Return type:
np.array
- class languagechange.models.representation.contextualized.BERT(pretrained_model, device='cuda', n_extra_tokens=2)[source]¶
Bases:
ContextualizedModelContextualized model for BERT embeddings.
- split_context(target_usage)[source]¶
Split the target usage into left, target, and right context tokens.
- Parameters:
target_usage (TargetUsage) – The usage data.
- Returns:
Tokenized left, target, and right context.
- Return type:
- center_usage(left_tokens, target_tokens, right_tokens)[source]¶
Adjust tokens to fit within the model’s maximum sequence length.
- add_special_tokens(left_tokens, target_tokens, right_tokens)[source]¶
Add special tokens to the tokenized sequences.
- process_input_tokens(tokens)[source]¶
Convert tokens to input IDs and attention masks for the model.
- batch_encode(target_usages)[source]¶
Encode a batch of target usages and generate embeddings.
- Parameters:
target_usages (List[TargetUsage]) – List of target usages.
- Returns:
Batch of encoded embeddings.
- Return type:
np.array
- encode(target_usages, batch_size=8)[source]¶
Encode target usages in batches.
- Parameters:
target_usages (Union[TargetUsage, List[TargetUsage]]) – List of target usages.
batch_size (int) – Batch size for encoding. Defaults to 8.
- Returns:
Array of encoded embeddings.
- Return type:
np.array
languagechange.models.representation.definition module¶
- class languagechange.models.representation.definition.DefinitionGenerator(embedding_model='all-mpnet-base-v2')[source]¶
Bases:
object- Parameters:
embedding_model (str)
- class languagechange.models.representation.definition.LlamaDefinitionGenerator(model_name, ft_model_name, hf_token, max_length=512, batch_size=32, max_time=4.5, temperature=1e-05, embedding_model='all-mpnet-base-v2', torch_dtype=torch.float16, low_cpu_mem_usage=False)[source]¶
Bases:
DefinitionGeneratorA tool to create short, clear definitions for words based on example sentences using fine-tuned Llama models.
- Parameters:
- model¶
The loaded fine-tuned model ready for generation.
- tokenizer¶
The tokenizer that prepares text for the model.
- eos_tokens¶
Tokens that signal the end of a definition.
- extract_definition(answer)[source]¶
Extracts the actual definition from the model’s response based on model type.
- Parameters:
answer (str) – The text generated by the model.
- Returns:
A cleaned-up definition string with proper formatting.
- Return type:
Notes
For Llama-2: Extracts text after ‘[/INST]’.
For Llama-3: Takes the last line.
Warns if output appears abnormal.
- generate_definitions(target_usages, system_message='You are a lexicographer familiar with providing concise definitions of word meanings.', template='Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}', encode_definitions=None)[source]¶
Generates definitions for all examples in batches using the model.
- Parameters:
target_usages (List[TargetUsage]) – A list of TargetUsage objects.
system_message (str) – The system prompt message.
template (str) – The template for the user prompt with placeholders {target} and {example}.
encode_definitions (str)
- Returns:
Generated definitions corresponding to each TargetUsage, as text, sentence embeddings, or both.
- Return type:
Union[List[str], List[np.ndarray], Tuple[List[str],List[np.ndarray]]
- print_results(target_usages, definitions)[source]¶
Displays the target word, example sentence, and generated definition for each entry.
- Parameters:
target_usages (List[TargetUsage]) – List of TargetUsage objects.
definitions (List[str]) – List of generated definitions.
- Return type:
None
- run(target_usages, system_message='You are a lexicographer familiar with providing concise definitions of word meanings.', template='Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}')[source]¶
Executes the complete workflow from generating definitions to printing the results.
- Parameters:
target_usages (List[TargetUsage]) – List of TargetUsage objects.
system_message (str) – The system prompt message.
template (str) – The template for the user prompt.
- Return type:
None
- class languagechange.models.representation.definition.DefinitionOutput(*args, **kwargs)[source]¶
Bases:
BaseModelRepresents the structured output for a word definition.
- Parameters:
args (Any)
kwargs (Any)
- Return type:
Any
- class languagechange.models.representation.definition.ChatModelDefinitionGenerator(model_name, model_provider, langsmith_key=None, provider_key_name=None, provider_key=None, language=None, embedding_model='all-mpnet-base-v2')[source]¶
Bases:
DefinitionGeneratorA model to generate concise definitions for target words using a chat model with structured output.
The model leverages an underlying chat model (initialized with LangChain) that returns a structured DefinitionOutput.
- Parameters:
- generate_definitions(target_usages, user_prompt_template="Please provide a concise definition for the meaning of the word '{target}' as used in the following sentence:\nSentence: {example}", encode_definitions=None)[source]¶
Generates definitions for each TargetUsage using a chat model.
- Parameters:
target_usages (List[TargetUsage]) – List of target usages.
user_prompt_template (str) – Template for the user prompt.
encode_definitions (str)
- Returns:
Generated definitions corresponding to each TargetUsage, as text, sentence embeddings, or both.
- Return type:
Union[List[str], List[np.ndarray], Tuple[List[str],List[np.ndarray]]
- class languagechange.models.representation.definition.T5DefinitionGenerator(model_path, bsize=4, max_length=256, filter_target=True, sampling=False, temperature=1.0, repetition_penalty=1.0, num_beams=1, num_beam_groups=1)[source]¶
Bases:
DefinitionGeneratorGenerates word definitions using a T5 model.
- load_data(path_to_data, split='test')[source]¶
Load and preprocess data from a file or directory.
- Parameters:
- Returns:
Preprocessed data with ‘Targets’ and ‘Real_Contexts’ columns.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the specified file or directory doesn’t exist.
- generate_definitions(target_usage_list=None, df=None, prompt_index=8, encode_definitions=None, return_df=False)[source]¶
Generate definitions for words in the DataFrame.
- Parameters:
- Returns:
Updated DataFrame with ‘Generated_Definition’ column.
- Return type:
pd.DataFrame
- Raises:
ValueError – If prompt_index is invalid or required columns are missing.
languagechange.models.representation.prompting module¶
- class languagechange.models.representation.prompting.SCFloat(*args, **kwargs)[source]¶
Bases:
BaseModel- Parameters:
args (Any)
kwargs (Any)
- Return type:
Any
- class languagechange.models.representation.prompting.SCDURel(*args, **kwargs)[source]¶
Bases:
BaseModel- Parameters:
args (Any)
kwargs (Any)
- Return type:
Any
- class languagechange.models.representation.prompting.PromptModel(model_name, model_provider, langsmith_key=None, provider_key_name=None, provider_key=None, structure='float', language=None, **kwargs)[source]¶
Bases:
object- Parameters:
- get_response(target_usages, system_message='You are a lexicographer', user_prompt_template="Please provide a number measuring how different the meaning of the word '{target}' is between the following example sentences: \n1. {usage_1}\n2. {usage_2}", lemmatize=True)[source]¶
Takes as input two target usages and returns the degree of semantic change between them, using a chat model with structured output. :param target_usages: a list of target usages with the same target word. :type target_usages: List[TargetUsage] :param system_message: the system message to use in the prompt :type system_message: str :param user_prompt_template: template to use for the user message in the prompt. :type user_prompt_template: str :param lemmatize: whether the target word should be lemmatized in the prompt or not. Uses trankit to lemmatize. :type lemmatize: bool
- Returns:
the degree of semantic change between the two instances of the target word, alternatively the whole message content if the output is not structured.
- Return type:
- Parameters:
target_usages (List[TargetUsage])
languagechange.models.representation.static module¶
- class languagechange.models.representation.static.RepresentationModel[source]¶
Bases:
ABCAbstract base class for all representation models. Provides a template for encoding methods.
- class languagechange.models.representation.static.StaticModel(matrix_path=None, format='w2v')[source]¶
Bases:
RepresentationModel,dictBase class for static word embedding models. Manages loading and accessing vector spaces.
- abstractmethod encode()[source]¶
Abstract method to perform encoding operations. Must be implemented in subclasses.
- class languagechange.models.representation.static.CountModel(corpus, window_size, savepath)[source]¶
Bases:
StaticModelCount-based word embedding model that builds a co-occurrence matrix from a corpus.
- Parameters:
corpus (LinebyLineCorpus)
window_size (int)
savepath (str)
- corpus¶
The corpus to process.
- Type:
- class languagechange.models.representation.static.PPMI(count_model, shifting_parameter, smoothing_parameter, savepath)[source]¶
Bases:
CountModelPositive Pointwise Mutual Information (PPMI) model that transforms a co-occurrence matrix.
- Parameters:
count_model (CountModel)
shifting_parameter (int)
smoothing_parameter (int)
savepath (str)
- count_model¶
The count-based model to transform.
- Type:
- class languagechange.models.representation.static.SVD(count_model, dimensionality, gamma, savepath)[source]¶
Bases:
StaticModelSingular Value Decomposition (SVD) model that reduces the dimensionality of a matrix.
- Parameters:
count_model (CountModel)
dimensionality (int)
gamma (float)
savepath (str)
- count_model¶
The input count-based model.
- Type:
- class languagechange.models.representation.static.RandomIndexing[source]¶
Bases:
StaticModelRandom Indexing model that creates low-dimensional vector spaces from a co-occurrence matrix.