The Challenge of Non-Technical Loss Detection using Artificial Intelligence: A Survey

Detection of non-technical losses (NTL) which include electricity theft, faulty meters or billing errors has attracted increasing attention from researchers in electrical engineering and computer science. NTLs cause significant harm to the economy, as in some countries they may range up to 40% of the total electricity distributed. The predominant research direction is employing artificial intelligence (AI) to solve this problem. Promising approaches have been reported falling into two categories: expert systems incorporating hand-crafted expert knowledge or machine learning, also called pattern recognition or data mining, which learns fraudulent consumption patterns from examples without being explicitly programmed. This paper first provides an overview about how NTLs are defined and their impact on economies. Next, it covers the fundamental pillars of AI relevant to this domain. It then surveys these research efforts in a comprehensive review of algorithms, features and data sets used. It finally identifies the key scientific and engineering challenges in NTL detection and suggests how they could be solved. We believe that those challenges have not sufficiently been addressed in past contributions and that covering those is necessary in order to advance NTL detection.


I. INTRODUCTION
O UR modern society and daily activities strongly depend on the availability of electricity.Electrical power grids allow to distribute and deliver electricity from generation infrastructure such as power plants or solar cells to customers such as residences or factories.One frequently appearing problem are losses in power grids, which can be classified into two categories: technical and non-technical losses.
Technical losses occur mostly due to power dissipation.This is naturally caused by internal electrical resistance and the affected components include generators, transformers and transmission lines.
The opposite class of losses are non-technical losses (NTL), which are primarily caused by electricity theft.In most countries, NTLs account for the predominant part of the overall losses [46].Therefore, it is most beneficial to first reduce NTLs before reducing technical losses [3].Nonetheless, reducing technical losses is challenging, too.In particular, NTLs include, but are not limited to, the following causes [15], [65]: P. Glauner, A. Boechat, L. Dolberg, J. Meira and R. State are with the Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, (email: {first.last}@uni.lu).
Manuscript received MONTH DD, YEAR.
• Meter tampering in order to record lower consumptions • Bypassing meters by rigging lines from the power source • Arranged false meter readings by bribing meter readers • Faulty or broken meters • Un-metered supply • Technical and human errors in meter readings, data processing and billing NTLs cause significant harm to economies, including loss of revenue and profit of electricity providers, decrease of the stability and reliability of electrical power grids and extra use of limited natural resources which in turn increases pollution.
There are different estimates of the losses caused by NTL.For example, in India, NTLs are estimated at US$ 4.5 billion [8].NTLs also reported to range up to 40% of the total electricity distributed in countries such as Brazil, India, Malaysia or Lebanon [22], [45].They are also of relevance in developed countries, for example estimates of NTLs in the UK and US range from US$ 1-6 billion [2], [46].
In order to detect NTLs, inspections of customers are carried out based on predictions.From an electrical engineering perspective, one method to detect losses is to calculate the energy balance [58], which requires topological information of the network.This does not work accurately for those reasons: (i) network topology undergoes continuous changes in order to satisfy the rapidly growing demand of electricity, (ii) infrastructure may break and lead to wrong energy balance calculations and (iii) it requires transformers, feeders and connected meters to be read at the same time.A more flexible and adaptable approach is to employ artificial intelligence (AI) [62].AI allows to analyze customer profiles, their data and known irregular behavior in order to trigger a possible inspection of a customer.However, carrying out inspections is costly, as it requires physical presence of technicians.It is therefore important to make accurate predictions in order to reduce the number of false positives.
The rest of this paper is organized as follows.Section II describes the field of AI.Section III provides a detailed review and critique of state-of-the-art NTL detection research employing AI methods.In Section IV, we identify the key challenges of this field that need to be accurately studied in order to enhance methods in the future.To the best of our knowledge, this topic has not been addressed yet in the literature on NTL detection.Section V summarizes this survey.

II. ARTIFICIAL INTELLIGENCE
The field of artificial intelligence (AI) attempts to both understand and build intelligent entities [62].This name was arXiv:1606.00626v1[cs.AI] 2 Jun 2016 coined in 1955 during the preparations for the first AI conference hosted at Darmouth College [39].While most people intuitively think about robotics, AI has more applications, such as learning patterns from data.This chapter provides an overview of the AI methods relevant to NTL detection.

A. Expert systems
Traditional AI systems were based on hand-crafted rules.Such systems are also called expert systems because they incorporate expert knowledge in their decision making process.While expert systems have initially been successful in tasks such as diagnosis and treatment of nuclear reactor accidents [48] or mission planning of autonomous underwater vehicles [33], they have the following shortcomings: (1) incorporating expert knowledge in rules is challenging, (2) many domains cannot accurately be described in rules and (3) domain knowledge may change over time requiring amendments of the rules [31].Nonetheless, expert systems are still being used nowadays.

B. Machine learning
To avoid the shortcomings of expert systems, a diametrically opposed approach is to learn patterns from data rather than hand-crafting rules.This branch of AI is called machine learning or pattern recognition.Both approaches have their justification and neither is generally better or worse than the other in artificial intelligence [25].Machine learning gives computers the ability to learn from data without being explicitly programmed [50].This property has allowed to significantly improve AI in various applications, such as in handwritten digit recognition [35], facial expression recognition [71] or speech recognition [27].Machine learning consists of three pillars: supervised, unsupervised and reinforcement learning.The term data mining is strongly related to machine learning, but has a wider scope that includes data cleaning, data preprocessing and concrete applications.
1) Supervised learning: Supervised learning algorithms learn patterns from labeled training examples (x (i) , y (i) ), in which x (i) is a training data point and y (i) is a corresponding label.This is also called function induction and typical applications include regression or classification [10].This pillar is best understood at present time and there are a wide variety of available learning algorithms.The choice of which learning algorithm to apply to a concrete problem is challenging and often requires comparative experiments.However, having a lot of representative data is considered sometimes to be more relevant than the actual algorithm [7].
2) Unsupervised learning: Unsupervised learning uses only unlabeled data points x (i) in order to find hidden structure in the data [10].Applications include dimensionality reduction methods such as the Principal Component Analysis (PCA) or t-sne [36] and clustering algorithms such as K-means.
3) Reinforcement learning: In many learning problems, there is no intuitively correct supervision.Reinforcement learning is a reward-based learning technique for actions in order to get to a goal [67].It has for example successfully been applied to humanoid robotics [61], autonomous helicopter flying [51] and playing the game of Go at super-human performance [64].

III. THE STATE OF THE ART
NTL detection can be treated as a special case of fraud detection, for which a general survey is provided in [11] and [32].It highlights expert systems and machine learning as key methods to detect fraudulent behavior in applications such as credit card fraud, computer intrusion and telecommunications fraud.This section is focused on an overview of the existing AI methods for detecting NTLs.For other surveys of the past efforts in the field, readers are referred to [15] and [30].Overviews of possible methods to manipulate a smart metering infrastructure are provided in [29] and [40].

A. Support Vector Machines
Support Vector Machines (SVM) [69] are a state-of-the-art classification algorithm that is less prone to overfitting.Electricity customer consumption data of less than 400 highly imbalanced out of ~260K customers in Kuala Lumpur, Malaysia having each 25 monthly meter readings in the period from June 2006 to June 2008 are used in [43].From these meter readings, daily average consumptions features per month are computed.Those features are then normalized and used for training in a SVM with a Gaussian kernel.For this setting, a recall of 0.53 is achieved on the test set.In addition, credit worthiness ranking (CWR) is used in [46].It is computed from the electricity provider's billing system and reflects if a customer delays or avoids payments of bills.CWR ranges from 0 to 5 where 5 represents the maximum score.It was observed that CWR significantly contributes towards customers committing electricity theft.A test accuracy of 0.77 and a test recall of 0.64 are reported.
SVMs are also applied on 1,350 Indian customer profiles in [21].They are split into 135 different daily average consumption patterns, each having 10 customers.For each customer, meters are read every 15 minutes.A test accuracy of 0.984 is reported.This work is extended in [20] by encoding the 4×24 = 96-dimensional input in a lower dimension indicating possible irregularities.This encoding technique results in a simpler model that is faster to train while not losing the expressiveness of the data and results in a test accuracy of 0.92.This work is extended in [22] by introducing high performance computing algorithms in order to enhance the performance of the previously developed algorithms in [20].This faster model has a test accuracy of 0.89.
Consumption profiles of 5K Brazilian industrial customer profiles are analyzed in [57].Each customer profile contains 10 features including the demand billed, maximum demand, installed power, etc.In this setting, a SVM slightly outperforms K-nearest neighbors (KNN) and a neural network, for which test accuracies of 0.9628, 0.9620 and 0.9448, respectively, are reported.

B. Neural networks
Neural networks [9] are loosely inspired by how the human brain works and allow to learn complex hypotheses from data.An ensemble of five neural networks (NN) is trained on samples of a data set containing ~20K customers in [41].Each neural network uses features calculated from the consumption time series plus customer-specific pre-computed attributes.A precision of 0.626 and an accuracy of 0.686 are obtained on the test set.
A data set of ~22K customers is used in [17] for training a neural network.It uses the average consumption of the previous 12 months and other customer features such as location, type of customer, voltage and whether there are meter reading notes during that period.On the test set, an accuracy of 0.8717, a precision of 0.6503 and a recall of 0.2947 are reported.
Extreme learning machines (ELM) are one-hidden layer neural networks in which the weights from the inputs to the hidden layer are randomly set and never updated.Only the weights from the hidden to output layer are learned.The ELM algorithm is applied to NTL detection in meter readings of 30 minutes in [52], for which a test accuracy of 0.5461 is reported.
A self-organizing map (SOM) is a type of unsupervised neural network training algorithm that is used for clustering.SOMs are applied to weekly customer data of 2K customers consisting of meter readings of 15 minutes in [13].This allows to cluster customers' behavior into fraud or non-fraud.Inspections are only carried out if certain hand-crafted criteria are satisfied including how well a week fits into a cluster and if no contractual changes of the customer have taken place.A test accuracy of 0.9267, a test precision of 0.8526, and test recall of 0.9779 are reported.

C. Expert systems and fuzzy systems
Profiles of 80K low-voltage and 6K high-voltage customers in Malaysia having meter readings every 30 minutes over a period of 30 days are used in [47] for electricity theft and abnormality detection.A test recall of 0.55 is reported.This work is related to features of [45], however, it uses entirely fuzzy logic incorporating human expert knowledge for detection.
A database of ~700K Brazilian customers, ~31M monthly meter readings from January 2011 to January 2015 and ~400K inspection data is used in [24].It employs an industrial Boolean expert system, its fuzzified extension and optimizes the fuzzy system parameters using stochastic gradient descent [6] to that database.This fuzzy system outperforms the Boolean system.Inspired by [43], a SVM using daily average consumption features of the last 12 months performs better than the expert systems, too.The three algorithms are compared to each other on samples of varying fraud proportion containing ~100K customers.It uses the area under the (receiver operating characteristic) curve (AUC), which is discussed in Chapter IV-A.For a NTL proportion of 5%, it reports AUC test scores of 0.465, 0.55 and 0.55 for the Boolean system, optimized fuzzy system and SVM, respectively.For a NTL proportion of 20%, it reports AUC test scores of 0.475, 0.545 and 0.55 for the Boolean system, optimized fuzzy system and SVM, respectively.
Five features of customers' consumption of the previous six months are derived in [4]: average consumption, maximum consumption, standard deviation, number of inspections and the average consumption of the residential area.These features are then used in a fuzzy c-means clustering algorithm to group the customers into c classes.Subsequently, the fuzzy membership values are used to classify customers into NTL and non-NTL using the Euclidean distance measure.On the test set, an average precision (called average assertiveness) of 0.745 is reported.
The database of [41] is used in [42].In the first step, an ensemble pre-filters the customers to select irregular and regular customers for training which represent well two different classes.This is done because of noise in the inspection labels.In the classification step, a neuro-fuzzy hierarchical system is used.In this setting, a neural network is used to optimize the fuzzy membership parameters, which is a different approach to the stochastic gradient descent method used in [24].A precision of 0.512 and an accuracy of 0.682 on the test set are obtained.
The work in [46] is combined with a fuzzy logic expert system postprocessing the output of the SVM in [45] for ~100K customers.The motivation of that work is to integrate human expert knowledge into the decision making process in order to identify fraudulent behavior.A test recall of 0.72 is reported.

D. Genetic algorithms
The work in [43] and [46] is extended by using a genetic SVM in [44] for 1,171 customers.It uses a genetic algorithm in order to globally optimize the hyperparameters of the SVM.Each chromosome contains the Lagrangian multipliers (α 1 , ..., α i ), regularization factor C and Gaussian kernel parameter γ.This model achieves a test recall of 0.62.
A data set of ~1.1M customers is used in [18].The paper identifies the much smaller sample of inspected customers as the main challenge NTL detection.It then proposes stratified sampling in order to increase the number of inspections and to minimize the statistical variance between them.The stratified sampling procedure is defined as a non-linear restricted optimization problem of minimizing the overall energy loss due to electricity theft.This minimization problem is solved using two methods: (1) genetic algorithm and (2) simulated annealing.The first approach outperforms the other one.Only the reduced variance is reported, which is not comparable to the other research and therefore left out of this survey.

E. Other methods
Optimum path forests (OPF), a graph-based classifier, is used in [54].It builds a graph in the feature space and uses socalled "prototypes" or training samples.Those become roots of their optimum-path tree node.Each graph node is classified based on its most strongly connected prototype.This approach is fundamentally different to most other learning algorithms such as SVMs or neural networks which learn hyperplanes.Optimum path forests do not learn parameters, thus making training faster, but predicting slower compared to parametric methods.They are used in [55] for 736 customers and achieved a test accuracy of 0.9021, outperforming SMVs with Gaussian and linear kernels and a neural network which achieved test accuracies of 0.8893, 0.4540 and 0.5301, respectively.Related results and differences between these classifiers are reported in [56].
Rough sets give lower and upper approximations of an original conventional or crisp set.Rough set analysis is applied to NTL detection in [66] on features related to [17].This supervised learning technique allows to approximate concepts that describe fraud and regular use.A test accuracy of 0.9322 is reported.The first application of rough set analysis applied to NTL detection is described in [12] on 40K customers, but lacks details on the attributes used per customer, for which a test accuracy of 0.2 is achieved.
Different feature selection techniques for customer master data and consumption data are assessed in [53].Those methods include complete search, best-first search, genetic search and greedy search algorithms for the master data.Other features called shape factors are derived from the consumption time series including the impact of lunch times, nights and weekends on the consumption.These features are used in K-means for clustering similar consumption time series.In the classification step, a decision tree is used to predict whether a customer causes NTLs or not.An overall test accuracy of 0.9997 is reported.
A different method is to estimate NTLs by subtracting an estimate of the technical losses from the overall losses [63].It models the resistance of the infrastructure in a temperaturedependent model using regression which approximates the technical losses.It applies the model to a database of 30 customers for which the consumption was recorded for six days with meter readings every 30 minutes for theft levels of 1, 2, 3, 4, 6, 8 and 10%.The respective test recalls in linear circuits are 0.2211, 0.7789, 0.9789, 1, 1, 1 and 1, respectively.

F. Summary
A summary and comparison of the performance measures of selected classifiers discussed in this review are reported in Table I.The most commonly used models comprise Boolean and fuzzy expert systems, SVMs and neural networks.In addition, genetic methods, OPF and regression methods are used.Data set sizes have a wide range from 30 up to 700K customers.However, the largest data set of 1.1M customers in [18] is not included in the table because only the variance is reduced and no other performance measure is provided.Accuracy and recall are the most popular performance measures in the literature, ranging from 0.45 to 0.99 and from 0.29 to 1, respectively.Only very few publications report the recall of their models, ranging from 0.51 to 0.85.The AUC is only reported in one publication.The challenges of finding representative performance measures and how to compare individual contributions are discussed in Chapters IV-A and IV-F, respectively.

IV. CHALLENGES
The research reviewed in the previous section indicate multiple open challenges.These challenges do not apply

A. Class imbalance and evaluation metric
Imbalanced classes appear frequently in machine learning, however, this fact is mostly not addressed in the literature.This topic is well covered for example in [28] and [68].The class imbalance also affects the choice of evaluation metrics.Most NTL detection research such as [17], [18], [43], [54] and [66] also ignore this topic and report high accuracies or recalls.
The following examples demonstrate why those performance measures are not suitable for NTL detection in imbalanced data sets: for a test set containing 1K customers of which 999 have regular use, (1) a classifier always predicting non-NTL has an accuracy of 99.9%, whereas in contrast, (2) a classifier always predicting NTL has a recall of 100%.While the classifier of the first example has a very high accuracy and intuitively seems to perform very well, it will never predict any NTL.In contrast, the classifier of the second example will find all NTL, but trigger many costly and unnecessary physical inspections.This topic is addressed for example in [37] and [41], but do not use a proper single performance measure to describe the performance of a classifier performed on an imbalanced dataset.
For NTL detection, the goal is to reduce the false positive rate (FPR) to decrease the number of costly inspections, while increasing the true positive rate (TPR) to find as many NTL occurrences as possible.[24] proposes to use a receiver operating characteristic (ROC) curve, which plots the TPR against the FPR.The area under the curve (AUC) is a performance measure between 0 and 1, where any binary classifier with an AUC > 0.5 performs better than random guessing.In order to assess a NTL prediction model using a single performance measure, the AUC was picked as the most suitable one.In the preliminary work of [24], we noticed that the precision usually grows linearly with the NTL proportion in the data set.It is therefore not suitable for low NTL proportions.However, we did not notice this for the recall and made observations of non-linearity similar to the work of [63] summarized in Table I.With the limitation of precision and recall as isolated performance measures, the F 1 score did not prove to work as a reliable performance measure.We believe that it is necessary to investigate more into this topic in order to report reliable and imbalance-independent results that are valid for different levels of imbalance.The Matthews correlation coefficient (MCC) defined in [38]: T P × T N − F P × F N measures the accuracy of binary classifiers taking into account the imbalance of both classes, ranging from −1 to +1.We believe that this measure should be assessed further for NTL detection.

B. Feature description
Different feature description methods have been reviewed in the previous section.Hand-crafting features from raw data is a long-standing issue in machine learning having significant impact on the performance of a NTL classifier [23].Generally, it cannot easily be said if a feature description is good or bad.Deep learning allows to self-learn hidden correlations and increasingly more complex feature hierarchies from the raw data input [34].This approach has lead to breakthroughs in image analysis and speech recognition [27].One possible method to overcome the challenge of feature description for NTL detection is by finding a way to apply deep learning to it.

C. Incorrect inspection results
In the preliminary work of [24] we noticed that the inspection result labels in the training set are not always correct and that some fraudsters may be labelled as non-fraudulent.The reasons for this may include bribing, blackmailing or threatening of the technician performing the inspection.Also, the fraud may be done too well and it is not observable by technicians.Another reason may be incorrect processing of the data.It must be noted that the latter reason may, however, also label non-fraudulent behavior as fraudulent.Handling noise is a common challenge in machine learning.In supervised machine learning settings, most existing methods address handling noise in the input data.There are different regularization methods such as L 1 or L 2 regularization [49] or learning of invariances allowing learning algorithms to better handle noise in the input data [10], [35].However, handling noise in the training labels is less commonly addressed in the machine learning literature.Most NTL detection research use supervised methods and this shortcoming of the training data and potential false positive labels in particular are not much reported in the literature, except in [42].
Unsupervised methods such as clustering or dimensionality reduction [36] that totally ignore the labels can be used to overcome this limitation.Also, in many situations, most customers have never been inspected.In a purely supervised training strategy, the unlabeled data is discarded.However, using it may support training.This domain is called semisupervised, for which semi-supervised clustering [5] has been proposed.Deep neural networks can be pre-trained using autoencoders or restricted Boltzmann machines [34] in order to take advantage of unlabeled data.Both, unsupervised and semi-supervised learning, should be further explored for NTL detection.Furthermore, we are not aware of research that has applied reinforcement learning to NTL detection.We believe that it can be promising to explore in this direction, as reinforcement learning needs only very little supervision in the form of rewards.

D. Biased inspection results
In statistics, samples of data must represent the overall population in order to make valid conclusions.This is a longstanding issue in statistics and therefore in machine learning, too, as discussed in [26].In the preliminary work of [24] we noticed that the sample of previously inspected customers may not be representative of all customers.One reason is, for example, that electricity suppliers previously focused on certain neighborhoods for inspections.NTL classifiers trained on biased inspection data are likely to be biased, too.To the best of our knowledge, this topic has not been addressed in the literature on NTL detection.Bias correction has initially been addressed in the field of computational learning theory [16].We believe that it can be promising to explore in this direction.For example, one promising approach may be resampling inspection data in order to be representative in terms of location and type of customer.

E. Scalability
The number of customers used throughout the research reviewed significantly varies.For example, [43] and [63] only use less than a few hundred customers in the training.A SVM with a Gaussian kernel is used in [43], for which training is only feasible in a realistic amount of time for up to a couple of ten thousand customers in current implementations [14].A regression model using the Moore-Penrose pseudoinverse [59] is used in [63].This model is also only able to scale for up to a couple of ten thousand customers [50].Models being trained on up to a couple of ten thousand customers include [41] and [17] using neural networks.The training methods used in those papers usually do not scale to significantly larger customer databases.Larger databases using up to hundreds of thousand or millions of customers are used in [18] and [24] using a SVM with linear Kernel or genetic algorithms, respectively.
We believe that a stronger investigation into time complexity of learning algorithms, scalable computing models and technologies such as Apache Spark [70] or Google TensorFlow [1] will allow to efficiently handle Big Data sets for NTL detection.This will also allow to perform the computations in a cloud, requiring researchers to make significantly lower investments in hardware.

F. Comparison of different methods
Comparing the different methods reviewed in this paper is challenging because they are tested on different data sets, as summarized in Table I.In many cases, the description of the data lacks fundamental properties such as the number of meter readings per customer, NTL proportion, etc.In order to make results better comparable, joint efforts of different research groups are necessary in order to address the comparability of NTL detection system performance based on a comprehensive freely available and sufficiently large data set.

V. CONCLUSION
Non-technical losses (NTL) are the predominant type of losses in electricity power grids.We have reviewed their impact on economies and potential losses of revenue and profit for electricity providers.In the literature, a vast variety of NTL detection methods employing artificial intelligence methods are reported.Expert systems and fuzzy systems are traditional detection models.Over the past years, machine learning methods have become more popular.The most commonly used methods are support vector machines and neural networks, which outperform expert systems in most settings.These models are typically applied to features computed from customer consumption profiles such as average consumption, maximum consumption and change of consumption in addition to customer master data features such as type of customer and connection type.Sizes of databases used in the literature have a large range from less than 100 to more than one million.In this survey, we have also identified the six main open challenges in NTL detection: handling imbalanced classes in the training data and choosing appropriate evaluation metrics, describing features from the data, handling incorrect inspection results, correcting the bias in the inspection results, building models scalable to Big Data sets and making results obtained through different methods comparable.We believe that these need to be accurately addressed in future research in order to advance in NTL detection methods.This will allow to share sound, assessable, understandable, replicable and scalable results with the research community.In our current research we have started to create a database that we are planning to make available in the future.It will allow research groups to work on these challenges and and to assess their advancement in NTL detection.