International Journal of Networked and Distributed Computing

Volume 9, Issue 1, January 2021, Pages 25 - 32

Featured Hybrid Recommendation System Using Stochastic Gradient Descent

Si Thin Nguyen1, Hyun Young Kwak1, Si Young Lee1, Gwang Yong Gim2, *
1Department of IT Policy and Management, Graduate School, Soongsil University, Seoul, Korea
2Department of Business Administration, Graduate School, Soongsil University, Seoul, Korea
*Corresponding author. Email:
Corresponding Author
Gwang Yong Gim
Received 9 October 2020, Accepted 18 November 2020, Available Online 5 January 2021.
DOI to use a DOI?
Recommendation system, stochastic gradient, decent matrix factorization, content-based, collaborative filtering, incremental learning

Beside cold-start and sparsity, developing incremental algorithms emerge as interesting research to recommendation system in real-data environment. While hybrid system research is insufficient due to the complexity in combining various source of each single such as content-based or collaboration filtering, stochastic gradient descent exposes the limitations regarding optimal process in incremental learning. Stem from these disadvantages, this study adjusts a novel incremental algorithm using in featured hybrid system combing the feature of content-based method and the robustness of matrix factorization in collaboration filtering. To evaluate experiments, the authors simultaneously design an incremental evaluation approach for real data. With the hypothesis results, the study proves that the featured hybrid system is feasible to develop as the future direction research, and the proposed model achieve better results in both learning time and accuracy.

© 2021 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (


Recommendation system is a branch of information filtering system that aims to predict “preference” or “rating” that a user prefers to give to an item (such as music, movies, or books) or social object (e.g. group or people) that had been unforeseen, using a model created based on the user’s social relationship (collaborative filtering model) or the features of an item (content-based model) [1,2].

Continuous growth of web and its applications has given a significant importance for recommendation systems [3]. However, recommendation system needs to deal with five main research problems: cold-start, data sparsity, accuracy and scalability [4].

Generally, there are three different groups of recommendation system: knowledge-based, Content-based (CB), Collaborative Filtering (CF) [2]. Consequently, these models use different sources of input data, and due to this difference, they all have different weaknesses and strengths. For instance, knowledge-based model can overcome cold-start problems better than CF or CB because they do not use ratings source. Contrarily, they are weaker than CF and CB in using information personalization from historical data [5]. These various personalities exhibited by different recommendation algorithms simultaneously figure out that recommendation model is not a one-size-fits-all model [6]. In another view, the recommendation algorithms themselves are considered as a part of system in giving the recommendations to users [7]. Such algorithms are efficient and simple; however, with the different data, they have limit in performance and it is not easy to further improve the accuracy of the recommendation system [8].

This is the reason why hybrid systems are the solution to explore these problems. However, the studies about hybrid system are still inadequate.

In this study, the authors design a hybrid system that combine CB and Incremental Matrix Factorization for the following reasons. First, CB and CF have different sources of data. The conjunction textual description of item in CB and community ratings in matrix factorization is expected to gain the accuracy in training data process. Second, almost all recommendation systems use numeric rating. Algorithms in incremental matrix factorization are considered in many researches for their benefit. Third, while most of the recent research focuses on algorithms for the basic model, few researches relate with the hybrid system and specific properties of evaluation real data unexplored [2].

In addition, this study aims to overcome the issues related to big data process. As mentioned above, the continuous growth of information, especially through the Internet, has given the issue related to the information overload. Netflix store over 17,000 movies in its data warehouse [9], and contains over 410,000 titles in its Kindle store. However, the speeds of computer is slower than the data sizes. Thus, regarding this issue, the capabilities of statistical machine learning algorithm are limited by the computing time rather than the data size [10].

To deal with this information overload, incremental learning algorithm is the effective solution with Stochastic Gradient Descent (SGD) algorithm [11]. However, the non-flexible feature in learning rate SGD require a better approach [2]. For this is reason, this study adjusts two mechanisms which are k group and momentum into novel algorithm – k Momentum SGD (KMSGD).

The final issue is how to evaluate in real data environment. While offline evaluation is suitable for achieving perspicuity about how different algorithms perform and comparing performance, it is limited in various ways [12,13]. In real data, evaluations need to be directly processed in every data point. This is not easy and comparing various algorithms through online evaluation is costly and difficult to synthetic enough understanding of these relationships [14]. In this study, the authors present a novel approach to evaluation in real-time by experimenting every point data separately.

Finally, the study aims to the following purposes:

  • To enhance traditional recommendation system with better prediction and improve accuracy, we design a feature hybrid system combining the textual feature of CB and the robustness of matrix factorization in CF.

  • To create a new incremental algorithm KMSGD that overcome the limit of SGD algorithm.

  • To evaluate in stream data, we design a simple approach with every point data process.

By comparing with other recommendation algorithms on Movie-Lens data, it is verified that the traditional model performance can be effectively improved by our model in both accuracy and learning time. The featured hybrid system is appropriate to further develop in future.

The structure of this study is designed as follows. The study shows a broad overview of incremental learning algorithm before showing the detail process of SGD algorithm. Then the ideal to design KMSGD is described typically in incremental learning algorithms section. The development of algorithm in recommender systems from basic model, CF model with Matrix factorization algorithms to hybrid system is figured out in the development of model in recommendation systems section. Proposed model section give a specific description of model. Before discussing about the results in results and discussion section, the authors mention evaluation and data in evaluation issues and data section. The prequential evaluation protocol is also described in this section. Finally, conclusion section provides a brief conclusion about the study as well as the future works.


2.1. Incremental Learning

Traditional machine learning methods provide typically powerful method to give structural information from original digital data and almost all recent applications are restricted to the batch data. Training is applied for all given data. Consequently, model selection and evaluation measurement can be computed by this full data. The whole training process can understand that the data are static [2]. In contrast, incremental learning is considered to the position of continuous model adaptation which rely on a constantly coming data stream [15]. This process is always present whenever systems operate autonomously such as in autonomous house or driving [1619]. Moreover, it is necessary to have an online learning in interactive perspective where training data rely on person feedback over time [19]. Finally, albeit static, that are almost digital data sets, can be so massive that they are de facto to tackle as a data stream, i.e. one incremental pass over the full data [2]. Incremental learning also figures out how to learn in such data streaming. It comes in different definition in the theory, and the utilization of the theory is not always suitable. Consequently, we present a meaningful definition to the appropriate terms of online learning, concept drift, and incremental learning, achieving typical attention to the supervised learning machine [2].

In the literature, there are a few researches on incremental CF. Incremental neighbourhood-based CF algorithm is applied [20,21]. For incremental matrix factorization, one first approach is designed [22], where the authors apply the fold-in method [23] to incrementally update to the factor matrices. An incremental learning algorithm for ranking source that utilizes a selective sampling approach is designed [24]. An incremental algorithm to update user factors is presented [25] by utilizing a simple method of the batch process. Two incremental algorithms using SGD are exposed [26]. This study is different from the above works. Stemming from SGD, we adjust the loop rate and momentum. Then, we design a novel evaluate method to evaluate the evolving accuracy of algorithm.

2.2. SGD Algorithm

2.2.1. The mechanism of SGD

Figure 1 present the mechanism of SGD [2]. Given a training dataset including data rows in the form 〈user, item, rating〉, SGD process various passes through the given data - iterations - until some stopping criteria is satisfactory – generally a convergence bound and/or maximum value of iterations [2]. At each iteration, SGD executes all observed ratings Rui and updates the correlative rows Wu and XiT , adjusting them in the inverse process of the gradient to the error, using a factor value of η ≤ 1 – known as learn rate or step size. The correlative error is set as errui=Rui-R^ui for each observed rating, and the following update process are executed:

WuWu+η(errui.Xi-λWu)XiXi+η(errui.Wu-λXi) (1)

Or with ϕ = (W; X), SGD contrarily process a parameter update for every training example x(i) and label y(i):

ϕ=ϕ-η.L(;x(i);y(i)) (2)

Figure 1

Algorithm 1-SGD.

In the training data, one major benefit of SGD is that complexity upgrades linearly with the number of observed ratings by getting benefit from the high sparsity of R [2].

Notably, after selecting a random training data point (i, j) ∈ D, SGD update Wi and Xj, and do not update factors for other training data point. This process savings follow process from representation of the global loss as a sum of local losses [27].

Thus, SGD refer to online learning or incremental gradient descent. In batched method, multiple local losses are averaged, and are also appropriate but usually have lower performance in experiment [2].

2.2.2. The limitation of SGD

In SGD, value of the learning rate (or step size) is very small α = 0.005. Consequently, a more suitable method to avoid local minimum and speed up convergence is to use the bold driver algorithm [28,29] to identify α in each iteration. Theoretically, it is feasible to use different step size for different factors [30].

An impressive method of these models is that executing until convergence for all iterations can lead to small worsening of the resolution quality of the unobserved value. Finally, it is usually reasonable not to set the convergence criteria too strictly [2].

Another key problem with SGD is about initialization. For example, it can begin the factor matrices to smaller values in (−1, 1). However, the choice of initialization, can alter the eventual solution quality.

2.3. Proposed Algorithm – KMSGD

2.3.1. Group in mini batch

Mini-batch gradient descent is an adjustment of the SGD that divides the training data into smaller groups that are utilized to identify model error and update model coefficients. Process may choose to compute the gradient over the mini-batch or take the average of the gradient which further decrease the variance of the gradient [2].

The ideal of method is illustrated in Figure 2. Mini-batch gradient descent aim to set a balance between the robustness of SGD and the speed-up of batch gradient descent. It is the most important process of gradient descent applied in machine learning.

Figure 2

Algorithm for KMSGD.

Figure 3 explains the algorithm 2 we propose [2]. According to it, this algorithm has two different things from algorithm 1-SGD. First, the learning process set a single pass over the observed data. Notable in algorithm 2, at each data point u, i, the alteration to factor matrices W and X are set in a single step. One other possible method is to perform several iterations over each new data point, with increasing accuracy, at the cost of the additional time required to re-iterate [2]. Second, there are no data shuffling or other pre-processing is processed. Thus, we set the error: errui=Rui-R^ui , and upgrade the rows in W and XT by using the upgrade process in (3).

ϕ=ϕ-η.L(;x(i:i+k);y(i:i+k)) (3)

Figure 3

Algorithm 2 K-SGD.

There are two advantages in algorithm 2:

  • To decrease the time of the parameter upgrade, which can aim to more suitable convergence.

  • To make using of matrix optimizations machine learning that compute the gradient W.X.R. very effective.

2.3.2. Momentum

The name momentum stems from momentum in physics, wherein the weighted vector w, considered through a slight moving through parameter space, encounter acceleration from the gradient of the loss function [2]. Contrarily, traditional SGD is expected to keep moving in the same situation, preventing oscillations. Thus, momentum has been utilized effectively in studying artificial neural networks in recent years.

Momentum is a value that accelerate SGD in relevant direction and dampens oscillations [2]. This method utilizes a fraction of the vector vt of the previous-time step to the recent update vector:

vt=γ.vt-1+η.L()ϕ=ϕ-vt (4)

The momentum term γ is set to 0:9 value. Basically, when momentum is utilized the process gains faster convergence and reduces oscillation. The mechanism of algorithm is illustrated in Figure 2 by adding vt in each iteration [2].


3.1. Single Model

In single model, algorithms play an important role and directly affect the performance of the recommendation system. The popular algorithms in recommendation can be divided into two main groups: CB model [31] and CF model [32].

3.1.1. Content-based

Content-based model utilizes features of users and items through the analysis of textual information, such as the features of items or user demography and document content to make recommendations. However, this feature extraction is difficult to gather or even fake; thus, this model has considerable limitations.

Figure 4 shows that the extraction the feature vector of item is the most important phase. Therefore, TF-IDF text mining algorithm is utilized in this study:

tf(t,d)=0.5+0.5ft,dmax{ft,d:td} (5)

The loss function of content-base is presented as below:

τn(Wn,bn)=12sn X^ nT.Wn+bnen-y^n 22+λ2sn Wn 22 (6)

Figure 4


In content-based model, the important step is to identify the typical feature of each item, which is a numeric vector record indicating the key features of item. For example, the featured vector includes the typical feature of the item that are easily identify. Such features include the characteristics of a movie that is relevant to system [2].

3.1.2. Matrix factorization

In Matrix factorization, the system that includes n users and m items, we correspondingly establish a matrix R with size (n × m), in which the (u, i)th entry is the rui value - the rating value of user u to item i [2].

In fact, R matrix with many empty cells-sparsity problem, is considered to overcome in recommendation system. Estimating unobserved ratings is also a necessary requirement [2].

The loss function of Matrix factorization is presented as below:

τn(X,W,b,d)=12sn=1Nm:rmn=1(X mT.Wn+bm+dn-ymn)+λ2( W F2+ W F2) (7)

3.2. Hybrid System

Some recommendation systems combine various source aspects to create hybrid systems. Thus, hybrid systems can combine the strengths of different single model into unified system. Figure 5 shows three main ways of making hybrid recommendation systems [5]:

  1. 1.

    Ensemble design: the output from single algorithm is combined into a robust output.

  2. 2.

    Monolithic design: the combined recommendation algorithm is presented by using various data types.

  3. 3.

    Mixed systems: these systems utilize multiple recommendation algorithms, but the items recommended by the several systems are unified system together side by side.

Figure 5

The taxonomy of hybrid systems.

Another way that hybrid recommender systems can be classified into the following categories (seven groups): weighted, switching, cascade, feature augmentation, feature combination, meta-level and mixed. Generally, this is the view with more detail than another one [33].


4.1. CB and CF

As aforementioned, the CB and CF use various sources of input, and they have different strengths in different scenarios. Table 1 summaries the advantage and disadvantage between two these models.

Categories Collaboration Input source Advantage Disadvantage

User Item
Content-based Textual description of item • Item feature extract • Non-collaboration
• Sparsity • Cold-start
Collaborative filtering Neighborhood-Based Collaborative Filtering (NBCF) User–User Community ratings • Simple • Small Data
Item–Item • Easy to explain
Model • Cold-start • Sparsity
• More accuracy
Table 1

Advantage and disadvantage of basis models in recommendation system

Generally, content-based with textual feature is to overcome the cold-start problem. However, it does not use the relationship among user or item. In contrast, collaboration is more robustness by using this relationship. This is the reason the hybrid system with various sources is expected to gain many opportunities to achieve the best results.

4.2. Feature Combine Hybrid System

Because CB and CF use different source, textual in CB and rating source in CF, the question is how to combine two model in to unify model before using training data.

Figure 6 present this idea.

Figure 6

Feature combine hybrid systems

Ratings matrix R with size (m × n) is added d columns for features of items. Thus, the new ratings matrix is set size (m × (n + d)), wherein d is the number of feature item and n is the number of items. Then the objective function is computed as follows with a parameter vector θ:

J=Collabrativeobject(θ)+β.Contentobject(θ)+Regularization (8)

Notably, matrix R is an (m × n) latent ratings matrix, and C is a (d × n) content matrix, wherein each item is presented by d features. Examples include short reviews of items or properties of items. Since R is latent ratings matrix, missing values are set to be 0 value. Consequently, W is an (n × n) item-item coefficient matrix wherein the rating values are predicted as R^=R.W . In this case, however, we can also predict the rating values as R^=C.W . Thus, instead of only optimizing ||RRW||2, we implement an additional content-based value ||RCW||2. Together with diagonal/non-­negativity constraints and elastic regularization, the upgrade optimization model is computed as follows:

MinimizeJ=||R-RW||2+β.||R-CW||2+λ.||W||2+λ1.||W||1 (9)

subject to:


In a tuning phase, the weight parameter β can be determined. Although the rating values can be predicted either as R^=C.W or R^=R.W or as only the latter prediction function is applied. Thus, the term R^=C.W is only utilized to refine as an additional regularize in the objective function. Moreover, the purpose of additional term is to upgrade the generalization power of algorithm for future predicting.

This approach can be utilized for combining any other instants of CF (optimization) model with CB models. For example, in the instant of matrix factorization, this method can use an (n × k) shared item factor matrix X, a (d × k) content factor matrix Z and (m × k) user factor matrix U to identify the optimization model as follows:

MinimizeJ=||R-U.XT||2+β.||C-Z.XT||2+λ(||U||2+||X||2+||Z||2) (10)


5.1. Data

In the experiments, this study uses MovieLens-100M and MovieLens-100K which is presented in Table 2 to test hypothesis [2]. MovieLens-100K dataset includes about 943 users and 1.682 items with 100,000 rating values, and the dataset is download from the website. All the rating are from 1 to 5 value. MovieLens-1M dataset is also bigger with 1.000.209 ratings from 3.952 movies and 6.040 users. Notably, each user ranked at least 20 movies [2].

Features Data

MovieLens-100 K MovieLens-1 M
Users 943 6040
Items 1682 3952
Ratings 100,000 1,000,209
Ratings per user 106.4 165.6
Ratings per item 59.45 253.09
Rating sparsity 93.70% 95.81%
Table 2

The statistics of the two datasets

5.2 Compare Methods

5.2.1. Evaluation measurements

We use Root Mean Squared Error (RMSE) value, which are commonly used in machine learning to evaluate the accuracy of model. The formulation is as follows:

RMSE=1Ni,j(Rij-R^ij)2 (11)

The lower values of RMSE show the better prediction of the model.

The second measurements we use in experiment is learning time. This value is to evaluate the speed-up of learning algorithms.

5.2.2. Proposed evaluation method

To evaluate in incremental model, we design a prequential approach. Figure 7 shows the steps with every observed point data 〈u, i〉, presenting a rating interaction between user u and item i:

Figure 7

Real-data evaluation approach.

This approach provides following advantages:

  • It allows management of incremental system.

  • Offline evaluation can be integrated in online evaluation.

  • Both new user and item instance is suitable to evaluation.


The results of learning time and accuracy are presented in Tables 3 and 4 with the MovieLens-1M data and MovieLens-100K respectively. Moreover, the comparing average value among algorithms are shown in Figure 8.

Measures CB Collaboration-Based-Filtering (CF) Hybrid system

Neighborhood-Based CF (NBCF) Model-Based CF (MFCF) Ensemble Monolithic

User–User Item–Item ALS SGD KMSGD Sequential Parallel Featured hybrid system
RMSA 0.914 0.976 0.968 0.951 0.948 0.949 0.894 0.896 0.890
Learning time (ms) 0.886 1.186 1.166 1.986 0.118 0.101 0.108 0.106 0.104
Table 3

Overall results of MovieLens-100K data

Measures CB Collaboration-Based-Filtering (CF) Hybrid system

Neighborhood-Based CF (NBCF) Model-Based CF (MFCF) Ensemble Monolithic

User–User Item–Item ALS SGD KMSGD Sequential Parallel Featured hybrid system
RMSA 0.916 0.975 0.966 0.944 0.940 0.936 0.912 0.918 0.910
Learning time (ms) 3.016 3.180 3.168 4.168 0.201 0.198 0.204 0.201 0.199
Table 4

Overall results of MovieLens-1M data

Figure 8

Comparing average RMSA and time update value among algorithms.

6.1. About the Accuracy

Very impressively, the simple algorithm gains a best result with 0.914 in RMSA. In CF algorithms, the divergence is not significant and the KMSGD with 0.949 approximately same with 0.948 of SGD. However, the difference is highlighted in hybrid system group, especially featured hybrid system. If we only focus on CF algorithms [Neighborhood-Based Collaborative Filtering (NBCF) and Model-Based Collaborative Filtering (MFCF)], algorithm with SGD shows the best result. This means that the more complex model, the better performance recommendation system exception the impressive of CB. To explain this situation, the authors focus on data. In Movie-Lens data, the features of item are presented in 19 binary value (19 types of movies) with 1 for the kind of that type movie. TF-IDF algorithms become non-meaningful in extracting features in CB model.

For this reason, the same thing is proved with the 0.916 RMSA value of CB in Movie-Lens 1M.

However, featured hybrid system still gains the best value 0.910. It is assumed that, SGD algorithm is not much meaningful in predict model but the hybrid system is a best choice to recommendation system research.

6.2. About the Learning Time

The learning-time is the major part in this study. In big data, a power predicted systems become un-useful if they cost too much time to process. The hypothesis is that the more complex model, the slower in training data. This hypothesis is proved with the learning time value of CB, NBCF and ALS in MFCF with 0.886, 1.186, 1.166 and 1.986 ms respectively in MoviesLen-100K [2]. Very impressively, the algorithm with SGD, KMSGD and featured hybrid system is significantly better with 0.118, 0.101 and 0.104 ms respectively.

The results are the same with MovieLens-1M data. It means that the hypothesis the more complex model, the slower in training data is not correct. However, it became a big problem if we can use the incremental algorithm in model. It is proved with the Movie-Len 1M, the learning time of ALS compared with CB is significantly bigger: 4.168 and 3.016 ms. While the featured hybrid system continuously presents a lowest value with 0.199 ms.

Figure 8 presents a visual view about the comparing effectiveness among models. It can lead to conclusion that incremental algorithm and hybrid system deserve to depth research in recommendation system.


Stem from SGD algorithm, this study adjusts its mechanics in mini group and momentum. With experimental results, the proposed model combining feature of CB and the robustness of matrix factorization using incremental algorithm gain the better results in both learning time and accuracy with another state-of-the art algorithms. The contribute of study is also present in a novel evaluation approach in real data. Although this approach is still simple, at least it presents a novel direction research in recommendation system.

The limitations in this study is relevant to various data. In the real system, the data with new users, new item and ranking is continuously updated. Moreover, the evaluation approach in this study is just an initial idea for the power evaluation approach in the future.

In the future, the authors will consider to apply parallel and distributed data analysis. Furthermore, the hybrid recommendation system is also a highlighted research combined with deep learning algorithms such as the convolutional neural network and attempt to further develop their performance.


The authors declare they have no conflicts of interest.


[1]L Sharma and A Gera, A survey of recommendation system: research challenges, Int. J. Eng. Trends Technol., Vol. 4, 2013, pp. 1989-1992.
[5]CC Aggarwal, Recommender systems, Vol. 1, Springer International Publishing, Cham, 2016.
[6]MD Ekstrand, JT Riedl, and JA Konstan, Collaborative filtering recommender systems, Found. Trends Human–Comput. Interact., Vol. 4, 2010, pp. 81-173.
[9]J Bennett and S Lanning, The netflix prize, Proceedings of KDD Cup and Workshop, 2007, pp. 35.
[11]A Gepperth and B Hammer, Incremental learning algorithms and applications, European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, 2016.
[22]BM Sarwar, G Karypis, J Konstan, and J Riedl, Incremental SVD-based algorithms for highly scaleable recommender systems, in Proceedings of the Fifth International Conference on Computer and Information Technology, 2002, pp. 345-354.
[26]G Takács, I Pilászy, B Németh, and D Tikk, Scalable collaborative filtering approaches for large recommender systems, J. Mach. Learn. Res., Vol. 10, 2009, pp. 623-656.
[28]S Ruder, An overview of gradient descent optimization algorithms, pp. 2016. arXiv preprint arXiv:1609.04747
[29]R Battiti, Accelerated backpropagation learning: two optimization methods, Complex Syst., Vol. 3, 1989, pp. 331-342.
[33]R Burke, Hybrid recommender systems: survey and experiments, User Model. User Adapt. Interact., Vol. 12, 2002, pp. 331-370.
International Journal of Networked and Distributed Computing
9 - 1
25 - 32
Publication Date
ISSN (Online)
ISSN (Print)
DOI to use a DOI?
© 2021 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (

Cite this article

AU  - Si Thin Nguyen
AU  - Hyun Young Kwak
AU  - Si Young Lee
AU  - Gwang Yong Gim
PY  - 2021
DA  - 2021/01
TI  - Featured Hybrid Recommendation System Using Stochastic Gradient Descent
JO  - International Journal of Networked and Distributed Computing
SP  - 25
EP  - 32
VL  - 9
IS  - 1
SN  - 2211-7946
UR  -
DO  -
ID  - Nguyen2021
ER  -