Next Article in Journal
Identification of Parameters and States in PMSMs
Previous Article in Journal
Crowd Counting by Multi-Scale Dilated Convolution Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Which Influencers Can Maximize PCR of E-Commerce?

1
College of Computing & Informatics, Sungkyunkwan University, Seoul 03063, Republic of Korea
2
Department of Artificial Intelligence Convergence, Sungkyunkwan University, Seoul 03063, Republic of Korea
3
MechaSolution, Daegu 42715, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(12), 2626; https://doi.org/10.3390/electronics12122626
Submission received: 13 May 2023 / Revised: 6 June 2023 / Accepted: 7 June 2023 / Published: 11 June 2023
(This article belongs to the Section Computer Science & Engineering)

Abstract

:
The Web has provided an increasing proportion of use as a medium for e-commerce in addition to various recommender systems. It can be used for analyzing recommendation system-based feedback (e.g., a form in which a user inputs their preferences for various items as numerical values into a specific evaluation system) to estimate customer interest; in addition, analyzing multi-modal types of feedback (e.g., product purchase traces, inquiry lists, inquiry times, and comments) with deep learning can also be used to estimate user interest. As many companies around the world promote their products through micro-influencers on the Web, related research has continued to predict the purchase conversion rate of the influencer through a variety of technologies. In this work, we present a multi-modal micro-influencer analysis scheme for a marketing maximization strategy. Our scheme uses the multi-modal data stored in Mecha Solution’s own shopping mall of Korea, as well as famous Korean Internet platforms, and Coupang, Naver, and Oliveyoung’s data such as article posting comments and statistics information. By extracting the main characteristics of the real article postings from real users as opposed to those from factitious influencers posting articles and comments and identifying articles other than advertisements, influencer scores are obtained, assuming that articles other than advertisements can further increase the purchase conversion rate. Based on influencer score, we propose a multi-modal micro-influencer analysis scheme that recommends influencers use content-based collaborative filtering and user-based collaborative filtering for items that the influencer has not yet reviewed. The experiment was implemented to prove that the proposed scheme successfully achieves this goal.

1. Introduction

1.1. Social Influencers as a Marketing Medium

Social influencers are emerging as a major factor in marketing. Influencers are individuals with high influence and a ripple effect on social media sites such as Facebook (i.e., Meta) and Instagram worldwide. By definition, they share and communicate with digital neighbors about daily life and emerge as “stars”.
As the impact of influencers expands, they are frequently used as a medium for marketing because they have clear economic advantages over traditional TV advertising and sufficient marketing effects. With the increase in the number of influencers, a classification based on follower count has emerged as a marketing strategy. The criteria for follower classification may vary depending on the purpose and context, but generally, influencers with 1–10 k followers are defined as nano-influencers, while those with 10–100 k followers are defined as micro-influencers. Influencers with 100,000 to one million followers are defined as macro-influencers, and those with over one million followers are defined as mega-influencers, often referred to as online or offline celebrities.
This study focused on micro-influencers, which are gaining attention in the field of marketing in Korea. Micro-influencers have fewer followers compared to mega-influencers, but they offer advantages in terms of cost-effectiveness and having a closer relationship with their followers. As a result, many companies are utilizing them as marketing mediums. However, from the perspective of companies that would hire an influencer, it is not easy to identify an influencer that can create the most profit while considering the cost of hiring them. Therefore, research has been conducted to determine ways to find the most effective influencer.

1.2. Related Works

Previously, a person directly searched for and hired an influencer by examining their number of followers, views, and written information. However, this method is not objective, economical, or time-efficient, as there may be deviations from person to person [1,2].
In the study conducted by Tran et al. [1], the authors constructed a graph illustrating the connection between users who exhibit interest in a particular brand and an influencer group. This graph was generated using passion points and measures of information propagation. By implementing this algorithm on real Vietnamese social networks, they demonstrated that the marketing effectiveness was significantly improved compared to a group of influencers selected using previous methods.
Another study by Tran et al. [2] described ways to identify influencers by category and a method to identify a key opinion leader (KOL). Likewise, the authors also emphasized the importance of detecting categorical influencers.
Therefore, much of the previous related research [3,4] proposed influencer detection strategies using graph theory and transformer-based deep learning to measure time-sensitive and topic-specific influence on social networks.
In the study by Wang et al. [3], a new model, Time-Sensitive and Topic-Specific Influence Measurement (TTIM), was proposed to find influencers on social networks by simultaneously considering the period and topic of advertisement. SeededLDA was used to calculate the user’s interest in a topic from text data and identify the distribution of users and topics after influence propagation using the influence attention network of an artificial neural network. After that, the LSTM artificial neural network was used to derive the influencer scores of users and topics considering time.
The purpose of the study by Oh et al. [4] was to create a system using Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Dirichlet Allocation (LDA) algorithms so users can enjoy successful promotional effects when marketing new items by extracting and modeling influencer characteristics. On the basis of the video data of the influencer who made the promotional video for the “Swipe” app channel, the system analyzed the data provided by YouTube to obtain the correlation coefficient and drew a conclusion. The system then extracted words through the TF-IDF analysis and created a system through LDA modeling. The system was able to recommend influencers to companies that wanted to conduct marketing activities according to subject.
Name of ThesisPublished inResearch TopicsCharacteristicsDataset
1CID: Categorical Influencer Detection on microtext-based social mediaEmerald Insight, Vol. 44 No. 5, pp. 1027–1055.Proposed a system, Categorical Influencer Detection (CID), to identify categorical influencers on social media channelsUses a Variational Autoencoder (VAE) to simulate the LDA processFacebook, Twitter, YouTube, Instagram datasets
2A Graph Based Approach for Effective Influencer MarketingInternational Journal of Data Mining and Knowledge Management Process (IJDKP) Vol. 9, No. 4, 2019Proposed an algorithm to select the minimum number of influencers that can reach the desired audienceUses a greedy algorithm to rationalize the minimum number of influencers500 influencers from Pakistan with 5000 to 300,000 followers
3Measuring Time-Sensitive and Topic-Specific Influence on Social Networks with LSTM and Self-Attention2020, IEEE Access, Vol. 8, pp. 82,481–82,492Proposed an advanced framework to measure social network influence in terms of both time and topicUses SeededLDA, advanced GAT, and matrix-based LSTM to measure time and topic simultaneouslyThree labeled datasets from Twitter and Reddit
4Influencer Attribute Analysis-based Recommendation SystemJournal of the Korea Institute of Information and Communication Engineering, Vol. 23, No. 11: 1321~1329, Nov. 2019Proposed a system to recommend the best influencer for marketing or categorical usersUses a TF-IDF, LDA algorithm to make recommendationsNumerical data from YouTube videos which advertise the application “Swipe”.
5Analyzing Impacts of Country, Product, and Influencer on Purchase Intention in the Chinese Cosmetics MarketInternational Commerce and Information Review, Volume 22, Number 2, June, 2020: pp. 309~329Proposed the advertisement effect which can influence customer buying intentionsUses a structural equation model to analyze relationships and uses an SPSS program to check the validity of data for comparisonSurvey of online customers that buy cosmetic products
6Metrological Analysis of Online Consumption Evaluation Influence Commodity Marketing Decision Based on Data MiningMathematical Problems in Engineering, Volume 2020Proposed a model suitable for any platform and any commodity to mine the potential information behind ratings and reviews of online productsCreates a prediction model using an intelligent algorithm BP neural network and a fuzzy comprehensive evaluation model based on principal component analysisScores and comments on Amazon Market
However, these are not applied in marketing with a quantitative and qualitative analysis of the customer’s purchase interest.
In the study by Cho et al. [5], the publicity effect according to the characteristics of influencers was analyzed along with the influence of country and product image on consumer purchase intention in the Chinese cosmetics market. On the basis of the product evaluation data, the relationship between variables was analyzed using structural equations (SEM). Using the SPSS program, the reliability and validity of the variables were checked, and the variables were compared by quantifying the influencer’s expertise, reliability, interactivity, and attractiveness. The reliability of the influencers and the level of consumer interactions were confirmed to affect purchase intention.
Consumer reviews, ratings, and influencer blog posts have become important factors in optimizing marketing strategy. A research study on effective influencer detection has been conducted. The study of [6] analyzed whether ratings and reviews affect consumer psychology based on quantitative information analysis such as a rating for each product, and qualitative data analysis such as text evaluation. The authors used the backpropagation neural network prediction model to identify effective reviews. Effective reviews were identified based on the sum of the rating and the length of the review; reviews were considered effective when the error between the actual evaluation and the predicted evaluation was small. In addition, a fuzzy comprehensive evaluation model was used to analyze the product’s competitiveness by analyzing the evaluation data for each product. As a result, the fluctuation range was not stabilized and trended downward.
In addition, when analyzing the ratings according to the use of a specific word with a high frequency of use, there was no association with negative words. However, there was a correlation with high ratings and positive words. For example, in reviews with high ratings, the frequency of words such as “how many” and “love” was high, confirming that consumers directly expressed their liking of and interest in recommending the products. This was implemented through word segmentation and analysis using Python’s word segmentation module.
Influencer posts often contain multiple user comments, and many consumers are influenced by these comments when making purchasing decisions. User comments may include meaningless spam comments, formal comments without sincerity, and marketing comments for artificial promotion. Meaningless reviews can negatively affect purchase intentions [7]. The authors of [8,9,10,11] tried to detect the spam comments on social networks.
The study by Francesco Cauteruccio et al. [12] described ways to detect not safe for work (NSFW) content on Reddit. In this study, content detection was conducted by selecting patterns that have high frequency and utility in posts and comments. What this paper emphasizes is that not only frequency but also utility measurements were conducted.
Based on [7,12], it is important to classify utility. We used the basis for utility of articles as meaningful comments.

1.3. Our Works

Many studies are using deep learning methods to detect influencers based on topic-specific impact and making efforts to consider consumer interest. However, these schemes have not quantified the influencer score of the purchase conversion rate. The purchase conversion rate (PCR) measures the number of people that were exposed to an influencer’s article and purchased the product introduced in the article. As internal company data are required to obtain accurate purchase conversion rate data, this study presents a model that can indirectly obtain a value close to the purchase conversion rate using influencer scores.
Additionally, this study specifically considers fake comments. By extracting various characteristics of each influencer post, using the comments and portions of the article that do not contain advertisements, the influencer score is obtained for increased purchase conversion rate. Among the characteristics of these various influencers, it is necessary to take a closer look at the comments to detect the customer opinion at that time. If we look carefully at the comments on the blog, a few comments will contain content that has nothing to do with the article or product. Typically, we can find comments that aim to promote the commenter’s own blog, or that only say hello without mentioning the product. These kinds of comments are difficult to identify as showing interest in the article or product, and problems arise when the number of comments is applied as an evaluation factor to measure the completeness of articles. If we include all the number-increasing comments as described above, it becomes difficult to count the number of real consumer reactions; therefore, we can count the number of real reactions to blog posts by classifying real and fake comments using Long Short-Term Memory (LSTM) deep learning. This can first be considered in the preprocessed step of the proposed scheme to quantify the influencer score regarding purchase conversion rate.
Finally, we propose a technique to detect the top influencers who can maximize the purchase conversion rate in various fields. For each influencer, we consider the number of comments on all posts written by the influencer for five years as being above a certain level (ex: higher than the average number of comments), or instances where the meaningful and positive trend of those comments is higher than the overall ratio (i.e., the number of positive comments/total number of comments). After selecting the top influencers based on comment analysis using an LSTM model among the deep learning algorithms, we can simultaneously check the final score for each influencer by using the recommendation system based on the PCR data analysis of the posting article. Therefore, company marketers can be provided with a list of influencers selected by the proposed algorithm, making it much more convenient to select influencers.

2. Background

2.1. RNN and LSTM Background

2.1.1. Recurrent Neural Network (RNN)

An RNN, unlike an artificial neural network (ANN), forms a loop so that information persists throughout the network. As illustrated in Figure 1, if there is a neural network A, x and output o, the RNN consistently inputs the element in Xt to the successor output Ot. Therefore, RNNs are adequate to process data in the form of chains or lists. Due to this trait, RNNs are frequently used in speech recognition and language modeling.

2.1.2. Long Short-Term Memory (LSTM)

An LSTM can be considered as a sub-category of RNNs. Although RNNs make use of the output Ot−1, as the output is processed and the gap becomes wider, RNNs cannot connect current information and the information input in the distant past.
To deal with this problem, LSTMs have a rather complex structure and involve diverse linear interactions compared to their simple counterparts. Figure 2 shows the Ht of LSTMs. The upper arrow is where the cell state Ct−1 is inputted. The upper arrow seems more straightforward than the arrows below. This shows that Ct−1 faces minor interactions. The cell states are also passed through gates, the structures that control whether information is removed or added to the cell state. The gates, which consist of a sigmoid layer and a pointwise multiplication operation, output values between 0 and 1. The higher the value, the more of its components are let through. An LSTM network has three such gates. To be specific, the first layer, known as the “forget gate layer”, receives Ht-1, which is the output of the previous LSTM and Xt, and the t-th input, and then returns values between 0 and 1 which show how much Ct−1 should pass through. Next, another sigmoid layer, which is also known as the input gate layer, determines the values that will be updated. The tanh layer then makes new values of Ct, which are candidates that can be added to the state. The sigmoid layer and the tanh layer are combined to create an update to the state. After working through the layers, the output that passed through the first sigmoid layer ft is multiplied by the old cell state Ct−1. Moreover, the output of the second output layer and the tanh layer are multiplied as well. The two outputs from the multiplications are then added. Finally, to decide the output, the final sigmoid layer and tanh decide which part of the cell state will be the outcome of the process.

2.2. Transformer Model

While transformer models have made significant advancements in NLP tasks, this study mainly used LSTM. The reasons are as follows. In the case of an LSTM model, the hidden state can be interpreted to determine which words or contexts are important. As shown in Section 3, Research Method, our study analyzed comments and posts according to standards such as advertising, non-advertising, or IT and non-IT. We used LSTM due to its potential to be advantageous for a future plan because it can provide new insights like this, and also because of the efficiency of education, which is explained by the fact that the architecture is simple and requires less training time and computational resources. The method is also good for small training data. It requires less labeled training data than large pre-training models such as BERT or GPT. However, as the transformer model has many advantages, we suppose that we can develop our research if we use it. We will consider applying it in future studies.

2.3. TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a number obtained by multiplying the frequency of a word by the frequency of the inverse document. If a specific word is t and the set of all documents is D, then TF(t,d) represents the frequency of occurrence of t in document d. DF(t) is the sum of the number of documents d that appear more than once among all documents D of a specific word t. IDF(t) is the reciprocal of DF.
log(|D|)∙1/|{d∈D: t∈d} +1|.
The value of 1 is added to prevent the denominator from being 0 when t does not appear at all during D. Log is added to the numerator to prevent the IDF value from becoming too large when the frequency of t is small. TF-IDF is the product of TF and IDF, and it can be used to identify the importance of a specific word in a specific document.

2.4. Collaborative Filtering

Collaborative filtering is an algorithm that collects information from users and recommends their interests. Item-based collaborative filtering predicts customer preferences using the similarity between items. Cosine similarity and Pearson similarity are used to calculate the similarity between items. Collaborative filtering is a method in which items with similarity are grouped, and if a user does not use an item but uses an item in the same group, the item is recommended.

3. Research Method

3.1. Post Data Crawling of a Naver Blog

A web crawler is “an internet bot (i.e., spiders, web agent, worm) that systematically browses the World Wide Web, typically for the purpose of Web indexing.” It supports universal search engines such as Google, Yahoo, MSN, Windows Live, ask, Bing, etc. There are various types of web crawlers and their architectures are shown in Figure 3. They are used to fetch relevant pages from the web. One type of crawler, the preferential crawler, is selectively biased towards the most relevant pages, or the pages with the largest PageRank.
Preferential crawlers can generally be divided into two parts: focused crawlers and topical crawlers. A focused crawler is a web crawler that collects Web pages that satisfy some specific property by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Topical crawlers are increasingly used to address the scalability limitations of universal search engines by distributing the crawling process across users, queries, or even client computers.
Naver is one of Korea’s leading search services, and it is possible to use it to manage online cafes and personal blogs to form a community. Many influencers create personal blogs on Naver and run blogs containing promotional articles and marketing articles for each item. As a result, in this paper, Naver blog post data crawling was carried out in addition to the preferential crawler to collect influencer advertisements and general reviews. The crawling tool used Python’s selenium, requests, and beautifulsoup, and crawled the information contained in Naver blog posts about a specific item. The crawled data included the Naver blog body text, post title, number of likes, number of comments, number of images, number of videos, average daily number of visitors, presence of advertisements, and nickname. This process was used to check whether the bottom image had an advertisement URL, and if it did not, the text was extracted by optical character recognition (OCR), and keywords such as “sponsorship”, “provided”, “support”, and “manuscript fee” were identified and checked. The Naver blog stipulates that, if a post is sponsored, a notification should be placed at the bottom of the advertisement. As a result, it was possible to identify advertisements with high accuracy. We also developed a crawler that efficiently fetches all blog posts from an influencer as shown in Algorithm 1.
Algorithm 1 Naver Blog Crawler
INPUT: product name
save search url as baseurl + product name
WHILE end of the page
 page scroll down
END WHILE
get url link from each block
FOR url IN url list
 search each url
 make dataframe
 set column ‘Title’, ‘Blogger’, ‘Post URL’, ‘Post’, “Image_num’, ‘Paragraph num’, ‘Comment num’, ‘Video num’, ‘weekly viewer mean’, ‘Sympathy num’, ‘AD’, ‘Posting Date’
 scroll each element according to column
 make keyword list that implies AD
 use image tesseract to recognize character
 IF keyword in Post or Tesseract
  post is AD
 END IF
END FOR
save dataframe into csv file

3.2. Survey (i.e., Human Evaluation)

To determine the characteristics of the articles with products people want to purchase, a survey was conducted on 100 blog posts. The number of participants in the survey was 33, including 30 of Mecha Solution’s employees. One hundred blog posts were divided into 10 groups, and each person read and responded to 10 articles. Responses to each post were either yes or no, and only blog posts that were responded to with a “yes” were scored by adding 1 point. A total of 100 blog posts were scored on a scale of 0–3 or 0–4. As a result of the survey, 7 of the 100 blog post articles received a perfect score, and 32 articles received 0 points. In Table 1, the basic characteristics of these articles were compared and analyzed for articles with a positive response rate of 66% or greater and those with less than 25%. Images, stickers, and gifs, which are considered elements of readability, tended to be used more in good writing, and the number of characters per image was lower. To analyze the causes of the above results, we labeled the use of videos rather than the number of videos, as shown in Table 1. It is true that the use of videos improves readability; however, excessive use can decrease readability.

3.3. Click Data

Links were provided to influencers hired by the Mecha company to collect data on the number of clicks a specific link from a specific influencer received. There was a significant correlation between the number of clicks, number of comments, number of empathies, total number of visitors, number of neighbors, and number of daily average visitors. Therefore, we also consider the click data to verify the most efficient influencer with the data stored in Mecha Solution’s own shopping mall in Korea, as well as influencer-related information on Internet platforms such as Naver and Coupang regarding purchase conversion rates (PCR).

3.4. Review Data Crawling and Labeling of Web Shopping Mall (i.e., Coupang and Naver Smartstore)

Coupang, one of Korea’s leading shopping malls, founded in 2010 as an e-commerce company based in Seoul, South Korea, and incorporated in Delaware, United States, has a large number of users and uses policies such as the lowest price challenge, early morning delivery, and a personalized recommendation system. The Naver SmartStore is also a representative shopping mall in addition to the Naver site of the largest online marketplace in South Korea. As a result, in this paper, we collected and utilized the review (comment) data of Coupang and Naver SmartStore at a large scale to indirectly analyze user satisfaction with items, average users’ opinions, and average real-time trends. The review article is supposed to assign a rating from 1 to 5. However, only 1 and 5 points were crawled, excluding 2, 3 and 4. By comparing, analyzing, and classifying the reviews of the extremes of 1 and 5 points, we tried to identify the differences in people’s writing when they were satisfied or disappointed with the product. This can also be used for the automatic labeling (i.e., target of each sample in the big data) of advanced algorithms such as supervised learning or machine learning.
Reviews were also analyzed with LSTM deep learning as a form of language modeling. If we look closely at the reviews on the blog, a few comments contain content that has nothing to do with the text or the product. Typically, we can find comments aiming to promote the commenter’s own blog or comments that only say hello, without mentioning the product. It is difficult to say that these types of comments express interest in the products, and problems arise when the number of comments is applied as an evaluation factor to measure the completeness of the text. If comments are not classified, it becomes difficult to count the number of real consumer responses. Therefore, by classifying real and fake comments, the number of real responses to blog posts can be counted. However, since the collected reviews are not labeled, the labeling process was constructed and carried out as shown in Figure 4.
First, the comments were labeled using a positive-negative classifier. However, to distinguish real comments from fake comments as pursued in this study, additional labeling work was required beyond positive and negative classification. Therefore, a new labeling criterion was set to meet the additional labeling criterion. Finally, labeling was carried out using a code to match the new criteria, followed by a manual modification of the labeling values.
The positive-negative classifier used an LSTM model trained using movie reviews from the same portal site (Naver). Through this positive/negative classifier, comments were first classified as positive/negative. Next, to expand the classification of real comments and fake comments, the criteria for each class were established. As a fundamental criterion, the relationship with the product was taken and the criterion was expanded. Real comments were labeled as product-related comments. Comments containing product-related content, such as price inquiry comments about products, product questions, etc., or comments whose focus could be inferred by looking at the comments were classified as real comments. Next, fake comments were labeled as non-product-related comments. It is difficult to infer the interest in the product through comments such as event comments, promotional comments, expression of dissatisfaction with price, friendships within comments, simple greeting comments, etc. If there was no content about the product, we categorized the comment as fake; labeling can be checked using Figure 5 and Figure 6. After selecting the keywords for each characteristic that meet each standard and labeling them with codes, manual labeling was carried out. For keyword selection, the main words that appeared for each characteristic were selected, and data that were not labeled even through code labeling were labeled after additional review during the manual labeling process. Table 2 presents the labeled training data that can be checked when all labeling processes have been completed regarding the case study of jewelry product line cross-stitching.

3.5. Review Data Crawling (Oliveyoung)

Olive Young is a health and beauty store operated by CJ Group. It is the largest health and beauty brand in Korea, selling cosmetics, snacks, health foods, and household goods through both online and offline channels. Olive Young conducts promotional and experience group review events with various brands, and has an influencer system on its online platform to promote its review community. The product is provided free of charge to the experience group for the purpose of writing a review. In this paper, we collected a large amount of publicly available review data from Olive Young and analyzed it based on the presence or absence of advertisements (experience groups) and the rating of reviews. As a result, we were able to identify the characteristics of reviews based on each criterion.
Before conducting a comparative analysis of advertisement reviews (the experience group) and non-ad reviews (direct review), we performed data selection. Not all products underwent review events, and there is a difference in the number of reviews for each product; therefore, we only selected reviews for the same products. Accordingly, out of 1,709,273 reviews, of a total of 22,157 products, review data for 1535 products were selected; as a result, 27,898 advert reviews (the experience group) and 39,089 non-ad review (direct reviews) were used.

3.6. TF-IDF Analysis

The word frequency and TF-IDF were analyzed for the A, B, C and D cases as shown in Algorithm 2. In A, the TF-IDF was obtained, assuming that advertising and non-advertising posts were written by influencers and general users, respectively. When the TF-IDF of articles written by influencers were compared with those written by regular users, one characteristic was revealed. First, the posts of influencers had a higher TF-IDF of endings, such as ‘더라구요’,’라구요’,’네요’ compared to general users, and the writings of general users were ‘습니다’,’-다’ in Korea. The influencer’s writing had a more friendly tone, and general users’ writing was composed with a more rigid tone. In B, the TF-IDF of a blog post with a perfect score in the Mecha Solution survey was compared with the TF-IDF of a blog post with a score of 0. Positive words, such as “pretty”, “interesting”, and “awesome” in the writings of the group with relatively perfect scores showed high TF-IDF scores. The top 25 words in TF-IDF in Figure 7 and Figure 8 are shown in bar charts.
TF-IDF analysis was also conducted on 52,507 review articles with a rating of 1 and on 29,288 review articles with a rating of 5. Excluding words that appeared on both sides, negative words such as “no” and “not” appeared in articles with a rating of 1, whereas articles with a rating of 5 showed positive words such as “pretty” and “good”. The top 25 words in TF-IDF are shown in the bar charts of Figure 9 and Figure 10. Different from the analysis of the blog, the customers’ negative opinions and rationales such as “불량 (faculty in English)” regarding the review articles are distinct.
Algorithm 2 TF-IDF
INPUT: D < - blog posts
   TF < - term frequency dictionary
   DF < - DF dictionary initialized all 0
for each post in D do
 tokenizing
  for token in tokenized post do
   If text length < 1 then
    delete text
   end if
  end for
  stemming
end for
for tokenized post in D do
 for token in tokenized post do
  if token not in TF then
   TF[token] < - TF
  else
   TF[token] < - TF[token] + TF
  end if
 end for
end for
for term in TF do
 for tokenized post in D do
  if term in tokenized post then
   DF[term] < - DF[term] + 1
  end if
 end for
end for
get TF-IDF from TF and DF
In the case of the Olive Young review data, TF-IDF analysis also showed that the positive and negative ratings as well as the evidence for each rating were clear. A TF-IDF analysis was conducted for 8819 reviews with a rating of 1 and 902,339 reviews with a rating of 5. After extracting the top 100 words based on TF-IDF scores and removing words that appeared on both sides, the results were as shown in Figure 11 and Figure 12. In 1-point reviews, there were many negative opinions such as “별로 (don’t like)”, “모르겠어요 (don’t know)” and “다시는 (ever again)”; however, in 5-point reviews, there were many positive opinions expressed by words such as “좋아요 (like)” and “가성비 (cost-effectiveness)”.

3.7. Word Frequency Analysis

Based on the data collected by the crawler for each site, the word frequency was also investigated using MeCab, and a word cloud was created and analyzed for comparison, as shown in Algorithm 3. The MeCab morpheme analyzer was used to recognize slang words in the review that appeared in large numbers. MeCab can add a custom dictionary to recognize words that are not in the official dictionary.
In the Naver Smart Store and Coupang review articles about “jewel cross stitch”, words such as “shipping”, “purchase”, “order”, and “gift” were characteristic of internet shopping malls, as well as “beads”, “frames”, and “pictures”. Many words were highly related, as shown in Figure 13 and Figure 14.
Overall, texts without advertisements had fewer words. People who write direct purchase reviews write only what they want to say without a format, because the purpose is not to promote. Alternatively, people who write publicity articles through advertisements write articles for the purpose of promoting products. Therefore, the number of characters used to explain the product in the format of a promotional article is relatively large. However, in the Naver blog, there was not much difference in word frequency ranking between blogs with and without advertisements, as shown in Figure 15 and Figure 16.
Therefore, the difference in word frequency was calculated by counting the number of occurrences of each word per blog, as shown in Figure 17 and Figure 18. In the direct purchase article, as shown in Figure 18, many words were directly related to the product, such as the smart store and Coupang (i.e., Daiso, order, Disney, purchase, picture frame, cubic, sunflower, gift, cross stitch, jewelry, size, picture), as well as real usage and reviews. Alternatively, in advertisements, as shown in Figure 17, many words were used that did not specify the product, such as emotional words that express feelings or words that were not directly related to the product (i.e., love, life, children, use, hobbies, recommendations, methods, product). The direct purchase article was concise, and the number of words related to the product was greater than the number of emotional words in relation to the product. The words shown in Figure 18 can be attributed to influencers of the Mecha solution company writing a blog that aims to encourage direct purchase usage (i.e., real review or direct review) as much as possible.
Similarly, when analyzing Olive Young’s review data, advert reviews (experience group) used an average of 178 words, while non-ad reviews (direct reviews) used only an average of 86 words. As in the Naver blog, there was no large difference in word frequency ranking, as shown in Figure 19 and Figure 20. However, there was a difference in keywords related to “purchase” such as “구입 (buy)”, “구매 (purchase)”, “세일 (discount)”, and “선물 (gift)”, which had a higher frequency in non-ad reviews, while advert reviews received products for free.
Algorithm 3 Text Word Frequency
read csv file
FOR post IN csv file Post
 use konlpy to extract each words
 count each word’s frequency
 filter it by most common word
 get font of each post
END FOR
make a histogram about word frequency

3.8. Noun-Josa Ratio and Word Count Analysis

There was a significant difference in the noun/josa ratio between direct reviews and advertisements. When the noun/josa ratio was 2.6 or greater, the ratio of direct reviews was 83%. In Figure 21, the red dots are direct reviews, and the blue dots are advertisements. It was possible to confirm that there were many direct review articles in the areas marked in yellow. Therefore, the noun/josa ratio was used as a standard to measure the degree of similarity with the direct review article. It was also possible to confirm the difference between the advertisement article and direct review article using the total number of words, as shown in Figure 22. In the advertisement article, the length of the article was, on average, long. However, the article length was short in direct review articles. When writing advertisements, influencers maintain the format of an article and write reviews by inflating the content to emphasize the strengths of the product, whereas direct reviews describe only the author’s honest feelings about the product and contain the required content of the article. This means that direct review articles are shorter because of their lack of format. There was a difference in the total number of words in advertisements and direct reviews; therefore, this was used as a standard to measure the degree of similarity between advertisements and direct reviews, as shown in Figure 22.

4. The Proposed Scheme

4.1. Framework

This section describes the methods used to implement an influencer recommender system. The structure of the methods is given in Figure 23.
First, we conducted quantitative analysis and qualitative analysis. In the quantitative analysis, we analyzed influencer posts or articles to obtain a text score and statistical information score. In qualitative analysis, we used LSTM to detect reliable comments. One of the detection criteria differs based on whether the product is a non-IT product or an IT product. In the case of non-IT products, subjective adjectives are used more than in that of IT, and in the case of IT products, objective adjectives are used more than in that of non-IT. Taking this into consideration, we detected meaningful comments.
After obtaining influencer scores based on quantitative and qualitative analysis, hybrid filtering was used with content-based filtering and collaborative filtering.

4.2. Quantitative Analysis

4.2.1. Text Score (Readability)

To measure how well an influencer writes, the concept of a text score was introduced. Two factors were used to calculate the text score: whether the article was a direct review and whether the article had good readability. First, to measure whether direct reviews are the same, the number of word frequencies for each item, TF-IDF, noun/josa ratio, and total number of letters were considered. To obtain the word frequency score, the frequency with which each word occurred in direct reviews was divided by the frequency of advertisements. The words were divided into the top 20 words, which appeared more frequently in direct reviews compared with the advertisement words and the bottom 20 words with low frequency. The word frequency score was obtained by dividing the frequency of occurrence of the top 20 words in the blog post by the frequency of occurrence of the bottom 20 words. The TF-IDF score was obtained by calculating the average of the TF-IDF of the advertisement and the direct review post, respectively. This value was obtained by subtracting the corresponding TF-IDF value. For the noun/josa ratio, there was a tendency for the noun ratio to be greater than that of advertisements in the direct reviews. To exclude outliers, the score was designed to provide a good score when the standardized value was 0 or greater and 1.96 or less, using standardization rather than normalization. The total number of letters was higher in advertisements than in direct reviews. The value obtained by subtracting the normalized value for the number of words from 1 was determined to be the total character score.
For articles with good readability, the score was calculated by considering the number of images, videos, GIFs, stickers, and characters per image based on the survey of Table 3. Images, videos, GIFs, and stickers were assigned a score according to this number, and the number of characters per image was standardized. The lower the score, the better.
Text score was finally obtained by adding all the features of articles that were close to the direct reviews and the features of articles with good readability, normalized from 0 to 1, respectively, as shown in Table 3. The formula for the text score is as follows:
text score = word frequency score × 5 + TF-IDF score × 15 + noun/josa ratio score × 5 + total character score × 10 + images score × 10 + video scores × 10 + GIF scores × 20 + stickers score × 15 + characters per image score × 10.

4.2.2. Statistical Information Score (Influencer Power)

Regardless of how well a person writes, a blogger with no awareness and a high conversion rate will have a low number of actual purchases because not many visitors view the post. To solve this problem, one blog post was scored by adding an index representing the impact of the influencer to the text score representing the score of the post. As a statistical information score, the impact of the influencer itself included the number of comments, number of likes, average daily number of visitors, and number of neighbors. The average daily number of visitors was selected as an indicator of the influencer’s impact.

4.3. Qualitative Analysis

To detect the optimal influencer for each IT- category and non-IT-related category, a qualitative analysis was performed on the representative expressions that were mainly used in IT-related and non-IT-related products, as well as a quantitative analysis, as shown in Figure 24, Figure 25 and Figure 26. In this paper, it was intuitively assumed that there are many objective adjectives in IT texts and subjective adjectives in non-IT texts, as shown in Figure 26.
Comment data consist of a sequence of words with a relatively long sequence length; important words that determine output values may be included at the beginning of the sequence. When using RNN, input values at the beginning of the sequence are lost, making accurate prediction difficult. Therefore, instead of RNN, which easily forgets the information of the initial input value, an LSTM model that solves the long-term dependency problem was used to detect influencers with highly reliable comments that are related to or empathize with the body of the post. As a result, the proposed technique integrally calculated the influencer score using all factors, ensuring that the posts written by the influencer are as readable as possible to maximize the purchase conversion rate, and the influencer itself is active and has a high level of influence, and each post written by the influencer has more sincere comments as opposed to formal comments.

4.4. Influencer Recommender System

If the influencer score of quantitative and qualitative analysis is given for each item for all of the influencer’s posts, comments, and statistical information, and the score is averaged, the influencer system can obtain an influencer score for each item. If we repeat the same operation for all supported influencers, influencer–item matrix data can be created for the influencer score, as shown in Figure 27. Because this matrix has a very high probability of being a sparse matrix, a hybrid filtering combining content-based filtering and collaborative filtering is applied, as shown in Figure 28. In content-based filtering, approximately 20 properties are created for each item, and if the cosine similarity of each property is high, it is judged to be a similar product. The products judged to be similar are sorted in descending order of similarity, and for the five most similar products, if the corresponding column is empty, the value obtained by multiplying the similarity by the score of the existing product is added. However, if the values are duplicates, the average value is returned. In this way, collaborative filtering was used with content-based filtering after reducing the sparsity. Collaborative filtering is a user-based collaborative filtering method designed to predict the score of the blanks based on five influencers with high similarity to a specific influencer.
Finally, using the purchase conversion rate (PCR) of Equation (3) with the terms of the Table 4, we can obtain the top 10 influencer recommendations for each item, as shown in Figure 29.
PCR = x ∗ α + y ∗ β (α + β = 1),
y = f1(x) ∗ I1(x) + f2(x) ∗ I2(x)

5. Results

The designed system can be used to predict the score of each item for a new influencer, and through this, the influencer with the highest score for the item can be selected. For example, if we plan a campaign to promote a product and receive support from influencers, we can create an influencer list for the identified influencers. After calculating the influencer score for the new influencer, the item–influencer matrix is completed. After that, the hybrid recommendation system that combines content-based filtering and collaborative filtering methods is applied to fill the matrix and identify the influencer with the highest score regarding the item to be promoted. However, no matter how high the influencer score, if the influencer is a person without cumulative influence based on their statistical information, the probability of exposure to consumers may decrease. In this case, the influencer score for the items of the influencer and the popularity of the influencer can be checked in the matrix through the average number of daily visits. No matter how high the influencer score, if the daily average number of visitors is significantly lower than that of other influencers, their impact is low and they can be excluded from the selection.

6. Conclusions

6.1. Contribution

In this study, we proposed an influencer recommendation system model that considers the PCR for a specific item by quantitatively and qualitatively analyzing posts, comments, and reviews from various e-commerce platforms. The strength of the proposed model is its consideration of both the domain and product category while also considering PCR. In addition, we reviewed the literature on the elements of the text that affect purchase, verified whether they can be applied to actual blog posts and reviews of e-commerce, and reflected these data in the calculation formula. Lastly, we addressed the issue of fake comments by utilizing deep learning and LSTM instead of simply relying on all comments.

6.2. Limitations and Etical Considerations

In this study, we mainly focused on influencers who use text, such as blogs and reviews. However, this has a limitation in that the study could not include influencers who use photos and videos on platforms such as YouTube or TikTok.
There are also limitations due to the available data. As domestic data were used, our model may have limitations to in its usability in overseas markets. Additionally, there may be differences in payment between micro-influencers, which is an important factor to consider when determining the choice of influencer, but we could not consider this as there were no data on this aspect.
Furthermore, there is a limitation regarding of the micro-influencer system itself. When there is an excessive number of advertising posts for a particular product written by influencers, this may have negative effects on consumers. Moreover, if a specific influencer encounters trouble or controversy unrelated to the product, such as privacy problems, managing the risks associated with that influencer may be difficult. These aspects can be considered limitations to influencer marketing.

6.3. Future Plan

Despite these limitations and considerations, micro-influencers are currently considered an important tool in the field of marketing, and proposing influencer recommendation systems is meaningful.
In a future study, we plan to propose recommendation systems that utilize a wider range of multi-data sources. In this study, only text was used, and only the number of image, gif, and video data were used to determine text scores. However, in the future, scores based on the specific characteristics of each data type could be established in order to obtain a more detailed understanding of consumer preferences.

Author Contributions

Methodology, J.-S.L.; Formal analysis, J.L., S.-M.K. and S.L.; Resources, D.J.; Supervision, H.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2022R1F1A1074696).

Data Availability Statement

The data presented in this study are available from the corresponding author, upon reasonable request. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Huynh, T.; Nguyen, H.; Zelinka, I.; Dinh, D.; Pham, X.H. Detecting the Influencer on Social Networks Using Passion Point and Measures of Information Propagation. Sustainability 2020, 12, 3064. [Google Scholar] [CrossRef] [Green Version]
  2. Quan, T.T.; Mai, D.T.; Tran, T.D. CID: Categorical Influencer Detection on microtext-based social media. Online Inf. Rev. 2020, 44, 1027–1055. [Google Scholar] [CrossRef]
  3. Zheng, C.; Zhang, Q.; Long, G.; Zhang, C.; Young, S.D.; Wang, W. Measuring Time-Sensitive and Topic-Specific Influence in Social Networks with LSTM and Self-Attention. IEEE Access 2020, 8, 82481–82492. [Google Scholar] [CrossRef] [PubMed]
  4. Park, J.R.; Park, J.; Kim, M.; Oh, H. Influencer Attribute Analysis based Recommendation System. J. Korea Inst. Inf. Commun. Eng. 2019, 23, 1321–1329. [Google Scholar] [CrossRef]
  5. CHEN Qing-Yu, CHO Hyuk-So Analyzing Impacts of Country, Product and Influencer on Purchase Intention in the Chinese Cosmetics Market. Int. Commer. Inf. Rev. 2020, 22, 309–332. [CrossRef] [Green Version]
  6. Xu, Y.-H.; Huang, L.-F.; Guo, R.-R.; Zhang, X.-Y.; Zhu, J.-M. Metrological Analysis of Online Consumption Evaluation Influence Commodity Marketing Decision Based on Data Mining. Hindawi Math. Probl. Eng. 2020, 2020, 9345901. [Google Scholar] [CrossRef]
  7. Wu, S.; Wingate, N.; Wang, Z.; Liu, Q. The Influence of Fake Reviews on Consumer Perceptions of Risks and Purchase Intentions. J. Mark. Dev. Compet. 2019, 13. [Google Scholar] [CrossRef]
  8. Das, R.K.; Dash, S.S.; Das, K.; Panda, M. Detection of Spam in YouTube Comments Using Different Classifiers. In Advanced Computing and Intelligent Engineering; Springer: Singapore, 2020; pp. 201–214. [Google Scholar]
  9. Samsudin, N.M.; Foozy, C.F.M.; Alias, N.; Shamala, P.; Othman, N.F.; Din, W. Youtube spam detection framework usingnaïve bayes and logistic regression. Indones. J. Electr. Eng. Comput. Sci. 2019, 14, 1508. [Google Scholar] [CrossRef]
  10. Ezpeleta, E.; Iturbe, M.; Garitano, I.; de Mendizabal, I.V.; Zurutuza, U. A mood analysis on youtube comments and a methodfor improved social spam detection. In Proceedings of the Hybrid Artificial Intelligent Systems: 13th International Conference, HAIS 2018, Oviedo, Spain, 20–22 June 2018. [Google Scholar] [CrossRef]
  11. Hussain, N.; Turab Mirza, H.; Rasool, G.; Hussain, I.; Kaleem, M. Predilection decoded: Spam Review Detection Techniques: A Systematic Literature Review. Appl. Sci. 2019, 9, 987. [Google Scholar] [CrossRef] [Green Version]
  12. Cauteruccio, F.; Corradini, E.; Terracina, G.; Ursino, D.; Virgili, L. Extraction and analysis of text patterns from NSFW adult content in Reddit. Data Knowl. Eng. 2022, 138, 101979. [Google Scholar] [CrossRef]
Figure 1. Structure of an RNN.
Figure 1. Structure of an RNN.
Electronics 12 02626 g001
Figure 2. LSTM structure.
Figure 2. LSTM structure.
Electronics 12 02626 g002
Figure 3. Types of Web crawler.
Figure 3. Types of Web crawler.
Electronics 12 02626 g003
Figure 4. Data labeling flow chart.
Figure 4. Data labeling flow chart.
Electronics 12 02626 g004
Figure 5. Positive reviews criteria.
Figure 5. Positive reviews criteria.
Electronics 12 02626 g005
Figure 6. Negative reviews criteria.
Figure 6. Negative reviews criteria.
Electronics 12 02626 g006
Figure 7. Survey max score TF-IDF.
Figure 7. Survey max score TF-IDF.
Electronics 12 02626 g007
Figure 8. Survey zero score TF-IDF.
Figure 8. Survey zero score TF-IDF.
Electronics 12 02626 g008
Figure 9. Coupang 5-rating review.
Figure 9. Coupang 5-rating review.
Electronics 12 02626 g009
Figure 10. Coupang 1- rating review.
Figure 10. Coupang 1- rating review.
Electronics 12 02626 g010
Figure 11. Olive Young 1-point reviews.
Figure 11. Olive Young 1-point reviews.
Electronics 12 02626 g011
Figure 12. Olive Young 5-point reviews.
Figure 12. Olive Young 5-point reviews.
Electronics 12 02626 g012
Figure 13. Naver Smart Store word cloud.
Figure 13. Naver Smart Store word cloud.
Electronics 12 02626 g013
Figure 14. Coupang word cloud.
Figure 14. Coupang word cloud.
Electronics 12 02626 g014
Figure 15. Ad blog.
Figure 15. Ad blog.
Electronics 12 02626 g015
Figure 16. Non-ad blog.
Figure 16. Non-ad blog.
Electronics 12 02626 g016
Figure 17. Ad blog calculation.
Figure 17. Ad blog calculation.
Electronics 12 02626 g017
Figure 18. Non-ad blog calculation.
Figure 18. Non-ad blog calculation.
Electronics 12 02626 g018
Figure 19. Ad review (experience group).
Figure 19. Ad review (experience group).
Electronics 12 02626 g019
Figure 20. Non-ad review (direct review).
Figure 20. Non-ad review (direct review).
Electronics 12 02626 g020
Figure 21. Noun/josa ratio.
Figure 21. Noun/josa ratio.
Electronics 12 02626 g021
Figure 22. Word count.
Figure 22. Word count.
Electronics 12 02626 g022
Figure 23. Framework.
Figure 23. Framework.
Electronics 12 02626 g023
Figure 24. IT adjective ratio.
Figure 24. IT adjective ratio.
Electronics 12 02626 g024
Figure 25. Non-IT adjective ratio.
Figure 25. Non-IT adjective ratio.
Electronics 12 02626 g025
Figure 26. IT and Non-IT adjective ratio comparison.
Figure 26. IT and Non-IT adjective ratio comparison.
Electronics 12 02626 g026
Figure 27. A sparse influencer–item matrix. Part of the ID has been masked using ***.
Figure 27. A sparse influencer–item matrix. Part of the ID has been masked using ***.
Electronics 12 02626 g027
Figure 28. Influencer–item matrix. Part of the ID has been masked using ***.
Figure 28. Influencer–item matrix. Part of the ID has been masked using ***.
Electronics 12 02626 g028
Figure 29. Top 10 Influencer Recommendations by each item. Part of the ID has been masked using ***.
Figure 29. Top 10 Influencer Recommendations by each item. Part of the ID has been masked using ***.
Electronics 12 02626 g029
Table 1. High/Low Comparison.
Table 1. High/Low Comparison.
High/Low
(w/Video)
Rate46.167
Gif num4.2
Weekly viewer mean1.693
Sticker num1.571
Image num1.545
Sympathy num1.094
Word count0.961
Video num0.980
Buddy num0.733
Table 2. Labeled training data.
Table 2. Labeled training data.
Comments (Korean Translated to English)Label
0Wow~ After putting it together like this, it really is the end of the year!! It’s so cool, teacher ^^ (smile face emoticon) I feel like I’ve seen the exhibition without giving money.1
1Each work is full of sincerity. These are wonderful works.1
2Every time I look at the pretty doll house very well. I think I should express my gratitude once a year, so I left a little comment.1
3Miniatures are overflowing with imagination! I’ve been looking for some good info. Have a relaxing evening~~!!1
4It’s really not easy to upload every single day, but what a great writer!! The works are also very healing and good:)1
5This is a good post ^^ (smile face emoticon) Like it~ I hope you always have a happy time^^ I would appreciate it if you could visit my blog too haha0
6What about China? When it comes to Korea, it’s kimchi.0
Table 3. Text Score Calculation Ratio.
Table 3. Text Score Calculation Ratio.
FeatureRatio
Word frequency score0.05
TF-IDF score0.15
Noun/josa score0.05
Word count score0.10
Image score0.10
Video score0.10
GIF score0.20
Sticker score0.15
Img word score0.10
Table 4. Terms for purchase conversion rate (PCR).
Table 4. Terms for purchase conversion rate (PCR).
TermsMeaning
xValues obtained through quantitative analysis (0~1)
yValues obtained through qualitative analysis (0~1)
αRatio of x(α + β = 1)
βRatio of y(α + β = 1)
I1(x)1 if x in an IT-related article 0 if it is a non-IT-related article
I2(x)1 if x in a non-IT-related post 0 if it is an IT-related post
f1(x)Value obtained according to the ratio of objective sentences (0~1)
f2(x)Value obtained according to the ratio of subjective sentences (0~1)
PCRPurchase Conversion Rate
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oh, H.; Lee, J.; Lee, J.-S.; Kim, S.-M.; Lim, S.; Jung, D. Which Influencers Can Maximize PCR of E-Commerce? Electronics 2023, 12, 2626. https://doi.org/10.3390/electronics12122626

AMA Style

Oh H, Lee J, Lee J-S, Kim S-M, Lim S, Jung D. Which Influencers Can Maximize PCR of E-Commerce? Electronics. 2023; 12(12):2626. https://doi.org/10.3390/electronics12122626

Chicago/Turabian Style

Oh, Hayoung, Jiyoon Lee, Joo-Sik Lee, Sung-Min Kim, Sechang Lim, and Dongha Jung. 2023. "Which Influencers Can Maximize PCR of E-Commerce?" Electronics 12, no. 12: 2626. https://doi.org/10.3390/electronics12122626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop