The Firm Life Cycle Forecasting Model Using Machine Learning Based on News Articles

The determination of the firm life cycle has been carried out in relation to the establishment of corporate strategy in the field of accounting or management. The life cycle prediction based on financial information is long because it is determined based on the financial performance of the entity over a year. This study sought to lay the foundation for overcoming this by using news articles to predict the life cycle of a company. In the process of quantifying news article data and predicting the firm life cycle, the method of selecting keywords that can represent the firm life cycle is presented, and the life cycle prediction model is verified with four machine learning techniques using selected candidate keywords. In this study, all four machine learning techniques showed a predicted static classification rate of nearly 60%, demonstrating the availability of news articles, which are unstructured text data, in predicting the corporate life cycle.


INTRODUCTION
The term corporate life cycle (or firm life cycle) is a concept that extends the theory on the life cycle of a product to the level of a company.The life cycle of an enterprise is sometimes defined by internal factors, financial factors and markets, and external circumstances as five phases of establishment, growth, maturity, decline, dissolution, or three phases that excludes establishment and decline.It is important because it can be used as an indicator to determine the strategic direction, market response, investment, and direction depending on the phase of the entity's life cycle.Existing research on corporate life cycles mainly consists of analysis methods using financial and market value information.Financial information means information produced with annual, quarterly, or semi-annual cycles.For this reason, there is an inherent limitation of lack of immediateness.In this study, based on the financial and market value information of the enterprise obtained from prior research, the company's life cycle is predicted.Then, the company's life cycle can be tested by utilizing the news article data.In this study, we demonstrate how to select candidates for keywords that can represent a company's life cycle and how to use selected candidate keywords to implement a machine learning model that uses the corporate life cycle phase defined by financial information as a target variable.

Firm Life Cycle Research
The life cycle of an enterprise is generally divided into four phases: founding, growing, maturing, and declining [1].However, depending on the researcher, it can also be divided into five phases (setup, growth, maturity, renewal, and decline) [2].Anthony and Ramesh [3] used dividend propensity, sales growth and corporate age to distinguish the entity's life cycle, and Black [4] separated the entity's life cycle by analyzing its accounting and operating activities, financial activities and investment activities at each phase of its life cycle.Dickinson [5] classified the entity's life cycle according to the direction of cash flows from operating activities, cash flows from investing activities and cash flows from financial activities.In Korea, Kwon [6] studied the effects of components of the book value of equity on stock prices according to the life cycle of the entity.The lifetime cycle predictors used sales growth rates, tangible fixed asset growth rates, and corporate age.The life cycle phases were classified into three phases: growth, maturity and decline [7].It also argued that rational decision-making requires analysis of the components of book value and the lifecycle of the entity [8].Choi et al. [9] used Korean data to analyze the relevance of accounting information to the value of the company's life cycle and to classify the entity's life cycle using variables such as sales growth rate, property, plant and equipment growth rate, and employee growth rate [9].Kwon and Moon [10] noted that the ratio of R&D and capital expenditures claimed by Bens et al. [11] as underlying assets, Market-to-Book value (M/B), and sales growth rates measured the life cycle [10].It also reviewed claims made by DeAngelo et al. [12] that the ratio of retained earnings to capital was used as a life cycle measure [10].In this study, Park and Park [13] measures the firm life cycle using the firm life cycle measurement.Kwon and Moon [10] studied Anthony and Ramesh's corporate life cycle measurements as alternatives, and Park and Park [13] used Kwon's method of measuring the firm life cycle, while Kwon and Moon [10] defined the life cycle based on the tercile, whereas Choi et al. 's [9] measurements were divided into the quintile.

Text Analysis
Among the text analysis areas, general analysis procedures in the field of classification are carried out as procedures for quantifying text, characterization, and application of machine learning analysis techniques [14].Although there are different methods for each procedure, one-hot encoding [15] and the frequency of keyword emergence (TF: Term Frequency) [16] were used in this study.There are two ways to quantify and represent keywords to classify or analyze specific documents, which are one-hot presentation and distributed presentation [14].Local representation is a method of expressing only elements corresponding to the word in an N-dimensional vector with a value of 1 when there is a collection of N words [14].Vectors generated by these methods of expression are called one-hot vectors and are called one-hot retention or one-hot encoding.Local representation is simple and easy to implement, but it has the disadvantage of failing to express the semantic relationship between words in the overall context of the set of words [14].Although there are these shortcomings, news articles contain practical content in the title, the length of the sentences is short and no overlapping words, and carry a single message.For this reason, it was judged that the one-hot presentation method would be more appropriate than the keyword frequency-based TF-Inverse Document Frequency (TF-IDF) method [17] or word2vec [18], which focuses on word order and surrounding words.
On the other hand, from the perspective of the analytical model, the study predicts the life cycle of the enterprise by using the year-business equivalent news article title as a document (case) and extracted keywords as independent variables.The number of keywords extracted at this time is so large that it is essential to reduce the high-dimensional word vector matrix to computable dimensions [19].In this study, the number of variables was reduced using unbalanced indices rather than matrix-based dimensional reduction techniques such as Principal Component Analysis (PCA) [20] and Non-negative Matrix Factorization (NMF) [21] because the number of variables should be reduced around keyword candidates that can represent the life cycle phases of an enterprise.On the other hand, the imbalance index is an index used to determine the degree of competition in the industry when measuring market structure in the field of economics, with the typical imbalance index being the higher enterprise concentration rate, the Hershman-Huffindal index and the entropy index [22].In this study, the candidates for keywords that can represent the firm life cycle were selected first, reduced in scope, and then again, the universal keywords were selected on a subjective basis.This is because even if any candidate keyword is selected, it can be verified with a model implemented through machine learning to demonstrate the representation of the candidate keyword.Among the mentioned unbalance indices, the unbalance index applied to this study is the entropy index.In the case of higher enterprise concentration techniques, the life cycle was divided into three phases and could not be utilized because it was not suitable for selecting the top k.

Machine Learning
Since the life cycle of a company that wants to be classified must be classified into three or more life cycle phases, among the machine learning algorithms based on guidance learning, an algorithm that can also be learned about three or more classes is appropriate.In addition, since a set of keywords affecting each life cycle should be composed of independent variables, an algorithm that can explain the effects of independent variables on three or more classes is appropriate.The decision tree, Support Vector Machine (SVM), and Artificial Neural Network (ANN) were selected as these criteria.Among the decision tree algorithms, the two algorithms, C5.0 [23] and Chi-square Automatic Interaction Detector (CHAID) [24], are considered suitable.For SVM, performance comparisons with other algorithms and reliability reviews of final selection algorithms were chosen because they were generally rated as the best classification among machine learning algorithms [25,26].For the same reason, Multi-Layer Perceptron (MLP) was chosen from the ANN [25,26].

Introduction
The experimental design of this study is as shown in Figure 1.The method of collecting news articles was based on HTTP.The next phase was defined as growth period, maturity period, and decline based on prior research conducted on domestic companies as the phase of defining the firm life cycle based on financial information.At the phase of predicting the firm life cycle based on the news articles to assert in this study, three processes are taken to extract keywords that predict the firm life cycle from news articles.The first process is to create two news articles, Corpus A and Corpus B. Corpus A consists of the full title of news articles corresponding to the target year-enterprise selected through financial information-based corporate life cycle forecasts.Corpus B organizes the titles of news articles grouped by life cycle, separating news articles from selected target year-company.Corpus A is needed to analyze the entire data in order to discover keywords representing the life cycle, and Corpus B is the necessary coppers to detect keyword candidates who can represent the life cycle.
In the second process of the News article-based corporate life cycle prediction phase, only nouns and noun phrase keywords are extracted from Corpus A using a morpheme analyzer to form a list of independent variable candidates for corporate life cycle 1 https://www.kisvalue.com/web/index.jsp 2 https://news.naver.com/main/main.nhn?mode=LSD&mid=shm&sid1=101 forecasting.Then, search the configured keyword variable candidate keywords in Corpus B and calculate the frequency of their appearance in each life cycle to form the 'Keyword Frequency Matrix by Life Cycle' .'Keyword Frequency Matrix by Life Cycle' was used to select keywords that appeared frequently at certain life cycle phases by utilizing unbalanced indices.Each selected keyword is then searched again on Corpus B to indicate whether it appeared, and the year-company and whether the keyword appeared constitutes a matrix of rows and columns, respectively.In this paper, this matrix is referred to as the Year-Firm × Keyword matrix (hereinafter YFK matrix).In the third stage, after applying the machine learning algorithm with the YFK matrix as input data, the most influential keywords in each life cycle are selected and divided into the learning and verification sets to verify the reliability of the model.
In the final step, the keywords detected to represent the firm life cycle will be applied to the model implemented at the previous process in contrast to the results of the 2018 financial information-based firm life cycle forecast, thereby validating the validity of the experiment and demonstrating the significance of the firm life cycle prediction through the news article that is intended to be argued in this paper.

Financial Data and News Articles Collection
Financial information-based firm life cycle forecasts were targeted at companies in the midsize corporation categories of KOSPI and KODAQ300.The data used for the determination was '191060/ Sales Growth (SG)' , '191020/Property, Plant, and Equipment (PP&E) Growth Rate (CE)' , '191090/Change in the Number of Employees (EXP)' , '118300/Retained earnings' , '261000/Capital at the beginning' , '0D2121/PBR(H)' and '0D2122/PBR(L)' .Among them, '118300/Retained earnings' and '261000/Capital at the beginning' were transformed into SE using a formula for dividing the Retained earnings into the Capital at the beginning according to the preceding study, and the values of '0D2121/PBR(H)' and '0D2122/PBR(L)' were also transformed from the Price to Book Ratio (PBR) of the high point and the PBR of the low point to M/B, the value of the market/book value.
The collection of news articles only extracted and stored the title and body of news articles, the time of article creation, URLs, and media companies in order to focus only on the information to be analyzed.

Firm Life Cycle Defined by Financial Data
The firm's life cycle can be determined by the sum of the values of each financial variable, and the method of determination is the same as Table 1 [13].After measuring the five indicators in Table 1 by median of the total 5-year values, including the current period and the past 4 years, each indicator value is divided into five quantile per each year, and points from 1 to 5 in order from the higher quantile.For example, a sample entity is classified as a quantile on a total basis [13] by giving 1 point to the top-five sales growth rate and 5 points to all indicators, and then by aggregating the scores on an enterprise-year basis and sorting the combined scores in ascending order.In this study, the sum of the values granted to each indicator is a minimum of 5 to a maximum of 25 points, and the year-to-enterprise summing up and then, the year-to-year groups of 1st, 3rd, and 5th quantile are defined as growth periods, maturity, and decadence periods.
According to a study by Park and Park [13], companies between 5 and 25 are divided into the quintile, and the 2nd and 4th quantiles are boundary zones and cannot have the discriminative power of the life cycle.Because the reason for using median for values for 5 years including the current period for each metric is that if only current period data are used, the previous study also utilizes median for 5 years, including that year [13].The reference scale and the reference measure for the five quantile classification are as shown in Table 2 [13].
The determined lifecycle of a company is verified by an analysis of basic statistics, with sales growth (SG), capital expenditure growth (CE), and EXP showing relatively higher growth rates in the growth phase than in other stages, and then slowing down in the maturity period and decreasing in the decline.Market value/book value (M/B) is expected to be high in growth and slow in maturity and then decline in decline in decline due to expectations of corporate growth.The proportion of retained earnings (SE) is verified for a tendency to decrease as the period progresses toward decline [13].The appropriate scope and confidence criteria for the underlying statistics of each discriminant scale were unknown and the methods validated in the preceding study were applied.The median profit surplus ratio from the current term to the past 4 years.

News articles preprocessing
Prior to screening the firm life cycle identification keywords from collected news articles, the main content of the preprocessing work is to match news articles to the year-to-year life cycle obtained from financial information.The year-to-year life cycle obtained from financial information informs the phase of the entity's life cycle in the year, in which the collected news data are based on the year and the company's name, and then only the news titles are performed, and the search results are stored together with the year-to-business information and life cycle phases to generate basic data for analysis.When storing, the company's name is stored except for the name of the entity included in the news title, because the company name itself cannot be considered a universal word representing the firm life cycle.The reason for searching only in news titles is that the body of news in the economic category often refers to the entity as well as to its peer group or similar stock themes, and because news related to a particular company refers most clearly to that entity in the title, the scope of the search was limited to titles only.
Large companies and distribution-related companies have a lot of basic news, but they distribute news articles to their respective media outlets in the form of press releases, which can lead to biased results in selecting candidates for the next stage due to the absolute high volume of articles from certain companies.For this reason, the top 1% of news data searched and stored as year-to-business were excluded from the analysis by grouping news data by company and arranging them in descending order of news volume.
The morpheme analyzer was then used to extract words with nouns and noun phrases from the title of the news where the ideal was removed, but only words with more than two syllables were extracted.The reason for extracting nouns and nouns is that other parts (adjectives, verbs, adverbs, etc.) are words used to describe or supplement the situation, and individual entities believe that the message they are trying to convey may contain relatively exaggerated or distorted expressions, which would be inappropriate to use as a keyword representing the life cycle they intend to demonstrate in this study.Extracting only words with two or more syllables has been ruled out because the inclusion of one syllable word may widen the range of keywords and obscure the meaning of the one syllable word, but the exclusion of one syllable is unlikely to affect the demonstration, given the purpose of this study.

Firm life cycle prediction reserve keywords selection
In this study, each extracted keyword is searched on the news again to calculate the frequency of appearance for each life cycle, and since keywords with equal frequency distribution for each life cycle cannot represent a particular life cycle, discriminating candidate keywords with biased frequencies in a particular life cycle are selected using the entropy index.The formula for the entropy index to select discriminating keywords is as follows: In this case, S i means the ratio of keyword frequency for each life cycle and N = 3, since it consists of three life cycle phases.The result of the calculation using entropy index is to sort the value of entropy index in ascending order, review the keywords with the lowest entropy index first, and then select the keyword to exclude.

Training and verification
For scientific verification of keyword candidates for selected life cycle identification, selected keyword candidates are verified through machine learning algorithms.To this end, a matrix in YFK format was created and the text was quantified in one-hot presentation.The generated YFK matrix was taught and verified by applying C5.0 of decision tree algorithm, which is a map learning-based machine learning algorithm, and MLP among CHAID, SVM, and ANN techniques.The target variable for learning is the life cycle phase, and the independent variable is the selected keyword candidates.When applying machine learning techniques, it can be said that the model is stable if the static rate in the verification set does not differ significantly from that in the learning set when the results learned with the learning set are applied to the verification set.In this study, the learning set and verification set are also constructed in a ratio of 70-30 to verify the stability of the model.In addition, among the four machine learning techniques tested in this study, C5.0 of the decision tree technique provides rules among each independent variable by analyzing the effects of independent variables, so the C5.0 algorithm secures keyword rules for lifetime cycle prediction and other techniques are used to compare the classification performance of the C5.0 algorithm.

Data preparation
The financial information collected for learning and verification is 1007 based on the source data, and 11,082 based on the year-to-enterprise basis.The number of year-to-year records for enterprises in their growth, maturity and decline, which eliminate unnecessary or meaningless records, is finally 2269; in the same

Defining firm life cycle through financial data
Securing financial information and defining an entity's life cycle was tested in accordance with the methodology of prior research.The number of financial information year-to-year entities to construct a learning/verification set is 4,498 and the number of year-to-year entities to construct a test set is 825.Based on financial information, the ratio of each component from phase 1 to 5 is within 2% of the learning.It was determined that there would be no significant problem in the composition of the life cycle phase of the test sets (Table 4).
The composition of the underlying data for the definition of the entity's life cycle in the financial information basis determines the distribution of each indicator by the actual values of each indicator in SG, CE, M/B, EXP and SE as shown in Table 5, and then calculates

News articles preprocessing result
The results of the search were retrieved only from the title section of the news data and stored together with the firm life cycle phases based on year-to-year and financial information criteria.The yearfor-business search excluded the names of entities with less than two characters.In addition, only the news of the year was required to search for articles related to the company, and news of department stores and large companies that produced relatively large amounts of articles was excluded from the top 1% list.

Model training and verification
The total number of nouns and noun phrases extracted was 6411 and 6411 keywords were divided and selected into 49 keywords by organizing the frequency of appearing in each life cycle in order to select candidates for the life cycle prediction keyword.Where possible, the 49 selected keywords were randomly selected around management-related keywords of the entity or keywords related to the external environment, and were included in Table 7.In addition, the values in the 'selection' column, expressed in Table 7, are given a life cycle label for judging keyword candidates and are not used to construct a YFK matrix for the actual learning model and are not relevant to the learning model.
Next, the YFK matrix was created for machine learning using selected keyword candidates for predicting the firm life cycle.The learning verification set consists of 4373 rows and 49 independent variables.In other words, a matrix with keyword candidates is made, where each keyword candidate is configured to have a value of 1 or 0 only, whether or not they appear in the year-business news article title. Figure 2 is an example of the composition of the YFK matrix.
Year-firm × keyword matrix files were entered as source data and analyzed using IBM Modeler 18.1.The machine learning analysis result is as shown in Table 8.
As Table 8 shows, C5.0 was the highest at 66.88%, and CHAID was the lowest at 50.43%.The static rate of the verification set was the the quintile by summing the values of each indicator.For the convenience of calculation, the final phase was expressed in numbers, and the values of 1, 3 and 5 in the final phase column correspond to the growing, maturing and declining periods, respectively.The difference in the static rate between the learning set and the verification set of each algorithm is the most stable at 1.18%, and the three algorithms except CHAID show a difference of <2%, so it can be said that the model has been implemented stably.

Prediction test
This study further confirmed the feasibility of the implemented model by constructing a separate set of tests to ensure the validity of the experiment.10,211 news data already collected for testing in 2018 were utilized.
Table 9 is the result of measuring the static rate by applying four different machine learning algorithms.As shown in Table 9, mechanical learning algorithms excluding CHAIDs show a slightly lower static rate than the static rate in the learning and verification sets, but generally around 63%.In the case of CHAID, the static rate of the test set was found to be higher than that of the learning and verification sets.As shown in the categorization rate of three machine learning algorithms other than CHAID, any keyword extracted from 2011 to 2017 was proven to be stable through learning and verification sets.Moreover, the results of testing the learning model with 2018 news data not included in learning show that the group of candidates for corporate life cycle prediction keywords used in the learning verification set can predict the firm life cycle with a static rate of about 63% for new news data.However, since the three algorithms except the C5.0 algorithm do not know which of the keyword candidates for the firm life cycle predictive, this study argues that it is the rule of independent variables for each phase of life cycle calculated in C5.0-the rule of the life cycle prediction keyword.

Prediction keyword rules
The rules for keywords that affected the static rate through the C5.0 algorithm according to the results of the previous experiment are listed in Table 10.The basic life cycle phase was calculated in three phases and the life cycle was predicted with 17 rules, one rule, and 15 rules respectively during the growth, maturity and decline phase.In Table 10, the 'number of cases' column refers to the number of cases determined by the keyword regular expression, and the probability is the probability that the rule will predict its life cycle.For example, a news article with the keyword "preemptive right" in the news article, and without the keyword "court, " can predict the life cycle during the growth period with a probability of about 86.7%, and the corresponding number of news articles is 13.

CONCLUSION
In this study, after determining the financial information based firm life cycle presented in the preceding study, we verified whether the firm life cycle can be predicted by keywords extracted from news data based on these discriminations.Consequently, in predicting an entity's life cycle, the entity's life cycle could be predicted at a static rate of about 63% over the life cycle determined on a financial information basis.All four machine learning algorithms showed a slight difference in the static rate between the learning set and the verification set, proving that the composition of the model was stable.Moreover, the experiment showed that the learned model was applied to a separate set of tests other than the learning and verification set, resulting in a correction rate of about 63%.News-based firm life cycle models are also expected to be useful in the actual business environment or in the assessment of the enterprise.This means that the entity's life cycle can be determined in a faster time than the existing financial information method, as it is characterized by daily production of news data, compared to that of the financial information method which is typically one year long or at a three-month cycle.

Future Challenges and Discussion Topics
The topics that can be studied in the future are as follows: • A study on the correlation of financial information with keywords in news article.
• Application of various machine learning algorithms to increase predictability.
• A preprocessing algorithm that selects the characteristics of text to determine the correlation between keywords and the firm life cycle.
• A study on how to determine data with different phases of time flow or development, such as life cycle, through text data.

Figure 1 |
Figure 1 | Design of the experiment.
Financial information was collected through National Information & Credit Evaluation Inc. (NICE's) KIS-Value 1 from 2007 to 2018.The collection of news articles was based on articles in the economic category of Korean portal site Naver 2 from 2011 to 2018.

Figure 2 |
Figure 2 | The composition of the YFK matrix.

Table 1 |
Methods for calculating financial information for predicting the firm life cycle

Table 2 |
Measures of financial information for measuring the firm life cycle

Table 3 |
Summary of financial information and news article data , records for test sets of financial information criteria are 1007, 5893, and 506, respectively.The study verification and test set were based on the enterprise category based on KIS-Value and the same number of test sets were defined using financial information for 2018.Meanwhile, for news data, the total target news data for learning and verification sets is 2,865,667 and the final year-tobusiness number is 34,261.The number of keywords extracted from the nouns and noun phrases is 6441, and the number of keywords randomly selected for this study is 49, and the number of records when the YFK matrix was composed of 4373.The composition of test sets for news articles is 488,748 overall news articles, which is the number of articles for 2018.Year-The number of corporate news is 10,253 and the number of records comprising the YFK matrix is 1709.Since the test is conducted by applying the machine learning model generated from the learning and verification set, the test set does not extract keywords or randomly select keyword candidates.Details of the data configuration are shown in Table3.The excluded data were for records missing years from the source data, records with data entered as zero prior to 2010, records containing data that were broken down into zero and generated errors, and records with values for each metric being upper or lower 1%. way

Table 4 |
Firm life cycle basic statistics based on financial information

Table 5 |
Example of life cycle discrimination based on financial information

Table 7 |
Reserve keyword for firm lifecycle prediction based on news articles G, Growth phase; M, Maturity phase; D, Decline phase.

Table 6 |
Year by corporate life cycle phase number of corporate news Table 6 lists the results of each life cycle phase news article obtained through search results in a table, divided into learning, verification, and test sets.

Table 9 |
Final results of machine learning algorithm learning, verification, and testing

Table 8 |
Comparison of classification rate of learning and verification by machine learning algorithms