TrillionDollarWords
Documentation for TrillionDollarWords.
TrillionDollarWords.BaselineModel
TrillionDollarWords.BaselineModel
TrillionDollarWords.BaselineModel
TrillionDollarWords.BaselineModel
TrillionDollarWords.get_embeddings
TrillionDollarWords.get_embeddings
TrillionDollarWords.get_embeddings
TrillionDollarWords.layerwise_activations
TrillionDollarWords.layerwise_activations
TrillionDollarWords.load_all_data
TrillionDollarWords.load_all_sentences
TrillionDollarWords.load_cpi_data
TrillionDollarWords.load_market_data
TrillionDollarWords.load_model
TrillionDollarWords.load_ppi_data
TrillionDollarWords.load_training_sentences
TrillionDollarWords.load_ust_data
TrillionDollarWords.prepare_probe
TrillionDollarWords.BaselineModel
— TypeStruct for the baseline model (i.e. the model presented in the paper).
TrillionDollarWords.BaselineModel
— Method(mod::BaselineModel)(atomic_model::HGFRobertaForSequenceClassification, queries::Vector{String})
Computes a forward pass of the model on the given queries and returns the logits.
TrillionDollarWords.BaselineModel
— Method(mod::BaselineModel)(atomic_model::HGFRobertaModel, queries::Vector{String})
Computes a forward pass of the model on the given queries and returns the embeddings.
TrillionDollarWords.BaselineModel
— Method(mod::BaselineModel)(queries::Vector{String})
Computes a forward pass of the model on the given queries and returns either the logits or embeddings depending on whether or not the model was loaded with the head for classification.
TrillionDollarWords.get_embeddings
— Methodget_embeddings(mod::BaselineModel, queries::Vector{String})
Computes a forward pass of the model on the given queries and returns the embeddings.
TrillionDollarWords.get_embeddings
— Methodget_embeddings(atomic_model::HGFRobertaForSequenceClassification, tokens::NamedTuple)
Extends the embeddings
function to HGFRobertaForSequenceClassification
. Performs a forward pass through the model and returns the embeddings. Then performs a forward pass through the classification head and returns the activations going into the final linear layer.
TrillionDollarWords.get_embeddings
— Methodget_embeddings(atomic_model::HGFRobertaModel, tokens::NamedTuple)
Extends the embeddings
function to HGFRobertaModel
.
TrillionDollarWords.layerwise_activations
— Methodlayerwise_activations(mod::BaselineModel, queries::DataFrame)
Computes a forward pass of the model on the given queries and returns the layerwise activations in a DataFrame
where activations are uniquely idendified by the sentence_id
. If output_hidden_states=false
was passed to load_model
(default), only the last layer is returned. If output_hidden_states=true
was passed to load_model
, all layers are returned. The layer
column indicates the layer number.
Each single activation receives its own cell to make it possible to save the output to a CSV file.
TrillionDollarWords.layerwise_activations
— Methodlaywerwise_activations(mod::BaselineModel, queries::Vector{String})
Computes a forward pass of the model on the given queries and returns the layerwise activations for the HGFRobertaModel
. If output_hidden_states=false
was passed to load_model
(default), only the last layer is returned. If output_hidden_states=true
was passed to load_model
, all layers are returned. If the model is loaded with the head for classification, the activations going into the final linear layer are returned.
TrillionDollarWords.load_all_data
— Methodload_all_data()
Load the combined dataset from the artifact. This dataset combines all sentences and the market data used in the paper.
- The
sentence_id
column is the unique identifier of the sentence. - The
doc_id
column is the unique identifier of the document. - The
date
column is the date of the event. - The
event_type
column is the type of event (meeting minutes, speech, or press conference). - The labels in
label
are predicted by the model proposed in the paper.
We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.
- The
sentence
column is the sentence itself. - The
score
column is the softmax probability of the label. - The
speaker
column is the speaker of the sentence (if applicable). - The
value
columns is the value of the market indicator (CPI, PPI, or UST). - The
indicator
column is the market indicator (CPI, PPI, or UST). - The
maturity
column is the maturity of the UST (if applicable).
TrillionDollarWords.load_all_sentences
— Methodload_all_sentences()
Load the dataset with all sentences from the artifact. This is the complete dataset with sentences from press conferences, meeting minutes, and speeches.
- The
sentence_id
column is the unique identifier of the sentence. - The
doc_id
column is the unique identifier of the document. - The
date
column is the date of the event. - The
event_type
column is the type of event (meeting minutes, speech, or press conference). - The labels in
label
are predicted by the model proposed in the paper.
We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.
- The
sentence
column is the sentence itself. - The
score
column is the softmax probability of the label. - The
speaker
column is the speaker of the sentence (if applicable).
TrillionDollarWords.load_cpi_data
— Methodload_cpi_data()
Load the CPI data from the artifact. This is the CPI data used in the paper.
- The
date
column is the date of the event. - The
value
columns is the value of the market indicator (CPI, PPI, or UST). - The
indicator
column is the market indicator (CPI, PPI, or UST).
TrillionDollarWords.load_market_data
— Methodload_market_data()
Load the combined market data from the artifact. This dataset combines the CPI, PPI and UST data used in the paper.
- The
date
column is the date of the event. - The
value
columns is the value of the market indicator (CPI, PPI, or UST). - The
indicator
column is the market indicator (CPI, PPI, or UST). - The
maturity
column is the maturity of the UST (if applicable).
TrillionDollarWords.load_model
— Methodload_model
Loads the model presented in the paper from HuggingFace. If load_head
is true
, the model is loaded with the head (i.e. the final layer) for classification. If load_head
is false
, the model is loaded without the head. The latter is useful for fine-tuning the model on a different task or in case the classification head is not needed. Accepts any additional keyword arguments that are accepted by Transformers.HuggingFace.HGFConfig
.
TrillionDollarWords.load_ppi_data
— Methodload_ppi_data()
Load the PPI data from the artifact. This is the PPI data used in the paper.
- The
date
column is the date of the event. - The
value
columns is the value of the market indicator (CPI, PPI, or UST). - The
indicator
column is the market indicator (CPI, PPI, or UST).
TrillionDollarWords.load_training_sentences
— Methodload_training_sentences()
Load the dataset with training sentences from the artifact. This is a combined dataset containing sentences from press conferences, meeting minutes, and speeches.
- The
sentence
column is the sentence itself. - The
year
column is the year of the event. - The labels in
label
are the manually annotated labels from the paper. - The
seed
column is the seed that was used to split the data into train and test set in the paper. - The
sentence_splitting
column indicates if the sentence was split or not (see the paper for details). - The
event_type
column is the type of event (meeting minutes, speech, or press conference). - The
split
column indicates if the sentence is in the train or test set.
TrillionDollarWords.load_ust_data
— Methodload_ust_data()
Load the UST (treasury yields) data from the artifact. This is the UST data used in the paper.
- The
date
column is the date of the event. - The
value
columns is the value of the market indicator (CPI, PPI, or UST). - The
indicator
column is the market indicator (CPI, PPI, or UST). - The
maturity
column is the maturity of the UST (if applicable).
TrillionDollarWords.prepare_probe
— Methodprepare_probe(outcome_data::DataFrame; layer::Int=24, value_var::Symbol=:value)
Prepare data for a linear probe. The outcome_data
should be a DataFrame
with a sentence_id
column, which should contain unique values. There should also be a column containing the outcome variable. By default, this column is assumed to be called value
, but this can be changed with the value_var
argument. The layer
argument indicates which layer to use for the probe. The default is the last layer (24).