TrillionDollarWords
Documentation for TrillionDollarWords.
TrillionDollarWords.BaselineModelTrillionDollarWords.BaselineModelTrillionDollarWords.BaselineModelTrillionDollarWords.BaselineModelTrillionDollarWords.get_embeddingsTrillionDollarWords.get_embeddingsTrillionDollarWords.get_embeddingsTrillionDollarWords.layerwise_activationsTrillionDollarWords.layerwise_activationsTrillionDollarWords.load_all_dataTrillionDollarWords.load_all_sentencesTrillionDollarWords.load_cpi_dataTrillionDollarWords.load_market_dataTrillionDollarWords.load_modelTrillionDollarWords.load_ppi_dataTrillionDollarWords.load_training_sentencesTrillionDollarWords.load_ust_dataTrillionDollarWords.prepare_probe
TrillionDollarWords.BaselineModel — TypeStruct for the baseline model (i.e. the model presented in the paper).
TrillionDollarWords.BaselineModel — Method(mod::BaselineModel)(atomic_model::HGFRobertaForSequenceClassification, queries::Vector{String})Computes a forward pass of the model on the given queries and returns the logits.
TrillionDollarWords.BaselineModel — Method(mod::BaselineModel)(atomic_model::HGFRobertaModel, queries::Vector{String})Computes a forward pass of the model on the given queries and returns the embeddings.
TrillionDollarWords.BaselineModel — Method(mod::BaselineModel)(queries::Vector{String})Computes a forward pass of the model on the given queries and returns either the logits or embeddings depending on whether or not the model was loaded with the head for classification.
TrillionDollarWords.get_embeddings — Methodget_embeddings(mod::BaselineModel, queries::Vector{String})Computes a forward pass of the model on the given queries and returns the embeddings.
TrillionDollarWords.get_embeddings — Methodget_embeddings(atomic_model::HGFRobertaForSequenceClassification, tokens::NamedTuple)Extends the embeddings function to HGFRobertaForSequenceClassification. Performs a forward pass through the model and returns the embeddings. Then performs a forward pass through the classification head and returns the activations going into the final linear layer.
TrillionDollarWords.get_embeddings — Methodget_embeddings(atomic_model::HGFRobertaModel, tokens::NamedTuple)Extends the embeddings function to HGFRobertaModel.
TrillionDollarWords.layerwise_activations — Methodlayerwise_activations(mod::BaselineModel, queries::DataFrame)Computes a forward pass of the model on the given queries and returns the layerwise activations in a DataFrame where activations are uniquely idendified by the sentence_id. If output_hidden_states=false was passed to load_model (default), only the last layer is returned. If output_hidden_states=true was passed to load_model, all layers are returned. The layer column indicates the layer number.
Each single activation receives its own cell to make it possible to save the output to a CSV file.
TrillionDollarWords.layerwise_activations — Methodlaywerwise_activations(mod::BaselineModel, queries::Vector{String})Computes a forward pass of the model on the given queries and returns the layerwise activations for the HGFRobertaModel. If output_hidden_states=false was passed to load_model (default), only the last layer is returned. If output_hidden_states=true was passed to load_model, all layers are returned. If the model is loaded with the head for classification, the activations going into the final linear layer are returned.
TrillionDollarWords.load_all_data — Methodload_all_data()Load the combined dataset from the artifact. This dataset combines all sentences and the market data used in the paper.
- The
sentence_idcolumn is the unique identifier of the sentence. - The
doc_idcolumn is the unique identifier of the document. - The
datecolumn is the date of the event. - The
event_typecolumn is the type of event (meeting minutes, speech, or press conference). - The labels in
labelare predicted by the model proposed in the paper.
We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.
- The
sentencecolumn is the sentence itself. - The
scorecolumn is the softmax probability of the label. - The
speakercolumn is the speaker of the sentence (if applicable). - The
valuecolumns is the value of the market indicator (CPI, PPI, or UST). - The
indicatorcolumn is the market indicator (CPI, PPI, or UST). - The
maturitycolumn is the maturity of the UST (if applicable).
TrillionDollarWords.load_all_sentences — Methodload_all_sentences()Load the dataset with all sentences from the artifact. This is the complete dataset with sentences from press conferences, meeting minutes, and speeches.
- The
sentence_idcolumn is the unique identifier of the sentence. - The
doc_idcolumn is the unique identifier of the document. - The
datecolumn is the date of the event. - The
event_typecolumn is the type of event (meeting minutes, speech, or press conference). - The labels in
labelare predicted by the model proposed in the paper.
We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.
- The
sentencecolumn is the sentence itself. - The
scorecolumn is the softmax probability of the label. - The
speakercolumn is the speaker of the sentence (if applicable).
TrillionDollarWords.load_cpi_data — Methodload_cpi_data()Load the CPI data from the artifact. This is the CPI data used in the paper.
- The
datecolumn is the date of the event. - The
valuecolumns is the value of the market indicator (CPI, PPI, or UST). - The
indicatorcolumn is the market indicator (CPI, PPI, or UST).
TrillionDollarWords.load_market_data — Methodload_market_data()Load the combined market data from the artifact. This dataset combines the CPI, PPI and UST data used in the paper.
- The
datecolumn is the date of the event. - The
valuecolumns is the value of the market indicator (CPI, PPI, or UST). - The
indicatorcolumn is the market indicator (CPI, PPI, or UST). - The
maturitycolumn is the maturity of the UST (if applicable).
TrillionDollarWords.load_model — Methodload_modelLoads the model presented in the paper from HuggingFace. If load_head is true, the model is loaded with the head (i.e. the final layer) for classification. If load_head is false, the model is loaded without the head. The latter is useful for fine-tuning the model on a different task or in case the classification head is not needed. Accepts any additional keyword arguments that are accepted by Transformers.HuggingFace.HGFConfig.
TrillionDollarWords.load_ppi_data — Methodload_ppi_data()Load the PPI data from the artifact. This is the PPI data used in the paper.
- The
datecolumn is the date of the event. - The
valuecolumns is the value of the market indicator (CPI, PPI, or UST). - The
indicatorcolumn is the market indicator (CPI, PPI, or UST).
TrillionDollarWords.load_training_sentences — Methodload_training_sentences()Load the dataset with training sentences from the artifact. This is a combined dataset containing sentences from press conferences, meeting minutes, and speeches.
- The
sentencecolumn is the sentence itself. - The
yearcolumn is the year of the event. - The labels in
labelare the manually annotated labels from the paper. - The
seedcolumn is the seed that was used to split the data into train and test set in the paper. - The
sentence_splittingcolumn indicates if the sentence was split or not (see the paper for details). - The
event_typecolumn is the type of event (meeting minutes, speech, or press conference). - The
splitcolumn indicates if the sentence is in the train or test set.
TrillionDollarWords.load_ust_data — Methodload_ust_data()Load the UST (treasury yields) data from the artifact. This is the UST data used in the paper.
- The
datecolumn is the date of the event. - The
valuecolumns is the value of the market indicator (CPI, PPI, or UST). - The
indicatorcolumn is the market indicator (CPI, PPI, or UST). - The
maturitycolumn is the maturity of the UST (if applicable).
TrillionDollarWords.prepare_probe — Methodprepare_probe(outcome_data::DataFrame; layer::Int=24, value_var::Symbol=:value)Prepare data for a linear probe. The outcome_data should be a DataFrame with a sentence_id column, which should contain unique values. There should also be a column containing the outcome variable. By default, this column is assumed to be called value, but this can be changed with the value_var argument. The layer argument indicates which layer to use for the probe. The default is the last layer (24).