When incorporating text data into econometric models, a fundamental question is how to repre-sent and select text features. Dictionaries and other unsupervised feature selection approaches have been widely used in financial and economic modelling to provide interpretable measures of interest from text data. However, these methods are often not optimised for the research ques-tion at hand, since feature selection is performed separately from subsequent econometric analy-sis, thereby potentially discarding information relevant to the inference task. We combine a super-vised latent Dirichlet allocation topic model with a multivariate Bayesian regression framework, allowing us to simultaneously perform feature extraction and parameter estimation. This has sev-eral advantages over existing supervised feature selection methods. With our Bayesian topic re-gression (BTR) model we aim to make three key contributions:

1. Inform economic theory: Our supervised topic modelling frameworks as an example, take a classic multi-factor model for predicting asset price characteristics.Our regression model takes those classic factors into account and adds factors of textual information by estimating a probabilistic topic model over a set of textual input.

Text Box 19

 

 

 

We can learn topics which directly explain the effect of interest and conduct inference jointly over the textual topics and the classic factors. By estimating coefficients on the topics and the classic factors jointly, it respects the Frisch-Waugh-Lovell Theorem which states that all variables must be residualised when carrying out multivariate regression in multiple steps. This allows us to recover unbiased parameter estimates of the model, which are important for interpretability in economic modelling frameworks. What is more, the interpretability of the textual topics can provide informative guidance about which presumably relevant factors have been neglected by economic theory so far. For example, if a relevant topic covers the occurrence of natural catastrophes, this can inform the researcher to either augment the factor model by existing weather or natural catastrophe indices, or to use the textual catastrophe topic itself to model its effect on the response variable. The latter option can be particularly helpful, if no non-textual information for this effect exist.

Figure 2 Recovery of true data generating parameters – Bayesian Topic Regression (BTR) vs standard supervised LDA (sLDA). Bottom figure: Bayesian posteriors over model parameters (mean and two units of standard deviation) of our BTR model vs supervised LDA. The ground truth parameter values are depicted in red. You can see the improved accuracy of our approach.

2. Improve decision making based on predictions: As we introduce a fully Bayesian inference procedure to a supervised topic model, we obtain uncertainty estimates not only about all parameter estimates but more importantly, we obtain a predictive distribution for our response variable of interest. This facilitates decision making on the basis of these predictions to take into account information of the model’s estimation uncertainties. As an example, a trading strategy based on asset price predictions of our model does not only benefit from the inclusion of additional textual information, but it can also make trade executions depending on how confident the model is about its predictions. In environments that include trading costs, such information can be useful to outperform strategies that simply trade on point estimates of the prediction for the response variable.

3. On-line learning with stochastic variational inference: We implement a stochastic variational inference approach for the approximation of the inherently intractable posterior distributions of such Bayesian generative models and show that it still allows the recovery of the true underlying data generating process whilst requiring a fraction of computation time when compared to sampling methods. Finally, we show that our model has the ability to allow for meaningful inference in situations of a very high-dimensional feature space, but a relatively small number of observations, as good estimates of the true data generating parameters are rapidly discovered in the on-line learning setup of the model. We demonstrate this on synthetic data and on central bank communication data.