This line of research has been established under the term of “financial statement analysis” and originates from the task of predicting the direction of future earnings based on financial statements to form investment decisions. Investors focus on earnings as earnings are predictors of future cash flows. This type of strategy is referred to as a “fundamental valuation” of a stock, which derives a firms’ intrinsic value based on discounting such future cashflow to their present value.

Figure 1: (a) demonstrates how the respective train- & test-set are generated for subsequent periods. Traversing from the past, the data of one period is always initially used as a test-set to evaluate the model trained on the past data, and then used as part of the training data for the next period. Consequently the amount of data for the models grows as the training progresses. (b) shows how the sample construction that constitutes the training data is being done as a rolling window. A sample of a particular firm with its dependent variable in a particular (orange) quarter, has the balance sheets of the past four quarters as inputs. These data points of particular quarters constitute the training- & test-sets denoted in figure (a).

Our models, including Deep Neural Nets, Gated Recurrent Units, and Random Forests (CART), are trained on approximately 400,000 past financial statements listed in the United States in the period between 1988 and 2006. We use a rolling window to construct a history of four previously published financial statements as the independent variable, and utilise this data as an input to predict the market reaction to the next (unknown) financial statement.

The source of our financial statement data is the Compustat FUNDQ file, while we take the stock prices for the market reaction from the Center for Research in Security Prices (CRSP). Both sources of information are linked by the “rdq” column in the FUNDQ file which contains the announcement date, and via the company ID contained in the “gvkey” and “LPERMNO” from the CRSP link table. As about 23% of the values in our selected financial statement items are missing, we employ the matrix completion via iterative soft-thresholded singular value decomposition algorithm to impute them.


To prevent global market movements impacting the predictions, our dependent variable represents abnormal returns of the stock calculated as the Buy-And-Hold Returns (BAHRs). The BAHRs are the product of abnormal daily returns controlled for the market movement of a day, and thereby represent a good proxy of the abnormal return an investor would actually earn over the holding period from one day before the announcement to 30 days after the announcement. We find the BAHRs to be normally distributed, having a standard variation of 25% with a mean of zero. Due to the possible ambiguity in the information content between minor negative and minor positive reactions, we create an epsilon environment of ± 5%, and ± 10%, in which we set predictions.

Figure 2: Abnormal (BAHR) profitability simulation at an epsilon of ± 5%. and actual reactions as zero. (a) Quarterly abnormal profit (b) Accumulated abnormal profitability.

The Deep Neural Net (DNN) model is trained on 10 epochs with 5 layers (i.e. with 1000, 2000, 1000, 500, 10 units), batch normalisation and an exponential linear unit as the activation function to minimise the mean squared error (MSE). Similarly the Gated Recurrent Unit (RNN) is also trained for 10 epochs. Both neural net based models are trained with a batchsize of 10 samples. Consisting of 100 regressions trees, the Random Forest (RF) has a maximum depth of 20.

We evaluate the models based on their accuracy concerning the predicted direction of the market reaction, and the derived profitability taking long (short) positions based on predictions. In our profitability simulation, we don’t take into account transaction costs because they vary between brokers, position sizes, and stocks. To calculate the quarterly return we assume that a fixed size nominal portfolio is spread among all trades, so that the calculated % gain (loss) represents the average profit (loss) that would have been made over all these trades.