4.1. Datasets and Experiment Setting
For the experiments, three datasets were selected for analysis. The first dataset, Intel Lab data (Bodik P, Hong W, Guestrin C. Intel Lab data.
http://db.csail.mit.edu/labdata/labdata.html, 2004), comprises information on data collected from 54 sensors deployed at Intel Lab from 28 February 2004, to 5 April 2004. Data were collected at a frequency of 30 seconds per sample, and temperature, humidity, light, and voltage properties were included. Experimental comparisons were performed using 6000 temperature, humidity, and light sensor data from this dataset. The second dataset, the Individual Household Electric Power Consumption dataset (Lichman M, UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2013), encompasses 2,075,259 measurements collected from December 2006, to November 2010 in residences in Sceaux, France. The data were acquired at 60 seconds per sample frequency and included attributes such as voltage, current, and power. Experimental comparisons were performed using 6000 voltage, current, and power sensor data from this dataset. The third dataset is the Dodgers Loop Sensor dataset (Lichman M, UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2013), which contains data collected from 10 April 2005, to 1 October 2005, on the Glendale ramp of the Los Angeles 101 North Freeway. Experimental comparisons were performed using 6000 data points within this dataset.
Upon conducting ADF root mean square tests on the above properties, we found
p-values of 0.937 and 0.9024 for Intel Lab data, and 0.7437 and 0.7598 for current and power in the Household Power Consumption data. These values were significantly higher than 0.05, leading to the acceptance of the null hypothesis
and indicating that the data exhibited stationary patterns. In contrast, ADF test results for the illumination attribute in Intel Lab data, voltage attribute in Household Power consumption data, and vehicle count attribute in Dodge Loop Sensor Data reveal
p-values of
,
and 0.0, respectively. These values were close to 0, leading to the rejection of
and suggesting that these data exhibited nonstationary patterns, as shown in
Table 1.
This paper compared two aspects to evaluate the effectiveness of the proposed method in data reduction: data reduction rate (DRR) and data reconstruction accuracy (DRA). The definition of the data reduction rate is shown in Equation (
10), where DRR represents the data rate, AD represents the total amount of data, and RD represents the total amount of remaining data after reduction. The data reconstruction accuracy is inspired by the Jaccard similarity between reconstructed and original data of the same length. Let
be the actual collected data, and
be the reconstructed data. The Jaccard similarity between
and
is calculated using Equation (
11), where DRA represents the data reconstruction accuracy, and n represents the number of reconstructed data. Meanwhile, we calculate the transmission of
when computing the transmission of our method.
4.2. Experiments on Adaptive Reduction Threshold
To ensure that the threshold varies within a specific range, the algorithm is executed on the same dataset, the range of threshold variation is calculated and shown in
Table 2 and
Table 3. The calculated range of threshold variation is also used as a statistical value in subsequent data reduction comparison experiments.
Experiments were conducted using the proposed data reduction method on the temperature and humidity attributes of the Intel Lab data dataset, as well as the current and power characteristics of the Household Power Consumption dataset with stationary-type variations. The threshold range for the temperature attribute was set between 0.01 and 0.1 °C, with an average adaptive threshold of 0.0598 °C, a median threshold of 0.07 °C, a mode threshold of 0.09 °C, and a Pearson correlation coefficient of −0.432 between the threshold variation process and the temperature attribute. The threshold range for the humidity attribute was set to 0.01–0.14%, with an average adaptive threshold of 0.0654%, a median threshold of 0.06%, a mode threshold of 0.01%, and a Pearson correlation coefficient of 0.4882 between the threshold variation process and the humidity attribute. For the current attribute of the Household Power Consumption dataset, the threshold range was set between 0.2 A and 4 A, with an average adaptive threshold of 2.45 A, a median threshold of 3.4 A, a mode threshold of 4 A, and a Pearson correlation coefficient of −0.548 between the threshold variation process and the current attribute. The threshold range for the power attribute was set between 0.25 and 7.2 W, with an average adaptive threshold of 2.26 W, a median threshold of 0.85 W, a mode threshold of 0.25 W, and a Pearson correlation coefficient of 0.5003 between the threshold variation process and the humidity attribute. The threshold variation for all attributes showed a moderate correlation with the Data, demonstrating the effectiveness of the proposed dynamic threshold adjustment mechanism in the Data reduction mechanism. The adaptive adjustment mechanism based on concept drift detection can adjust the reduction rate as the data change pattern evolves.
The following two figures depict the threshold variation process of stationary data.
Figure 2 corresponds to the Intel Lab dataset, where
Figure 2a shows the temperature data change,
Figure 2b shows the temperature threshold change,
Figure 2c shows the humidity data change, and
Figure 2d shows the humidity threshold change.
Figure 3 corresponds to the Household Power Consumption dataset, where
Figure 3a shows the current data change,
Figure 3b shows the current threshold change,
Figure 3c shows the power data change, and
Figure 3d shows the power threshold change.
In this study, the proposed method for data reduction was applied to nonstationary data from the Intel Lab data for light intensity, the Household Power Consumption dataset for voltage, and the Dodgers Loop Sensor dataset for vehicle count. The threshold range for the light intensity attribute was set from 0.1 to 0.9 Lux, with an adaptive threshold average of 0.5139 Lux, a median threshold of 0.5 Lux, and a mode threshold of 0.4 Lux. The threshold change process showed a weak negative correlation with the current attribute, with a Pearson correlation coefficient of −0.398. For the voltage attribute, the threshold range was set from 0.21 to 1.5 V, with an adaptive threshold average of 1.02 V, a median threshold of 1 V, and a mode threshold of 1 V. The threshold change process showed a weak positive correlation with the humidity attribute, with a Pearson correlation coefficient of 0.344. For the Dodgers Loop Sensor, the vehicle count threshold range was set from 1 to 7, with an adaptive threshold average of 1.383, a median threshold of 1, and a mode threshold of 1. The threshold change process showed a weak negative correlation with the humidity attribute, with a Pearson correlation coefficient of –0.372. When dealing with nonstationary data, the threshold change process is weakly correlated with the data attributes. The data fluctuation is relatively large, resulting in significant differences between adjacent data points. Consequently, the Kalman filter model may fail to predict the next data value accurately, and the error threshold will continue to decrease. The error threshold is maintained at a relatively low level to ensure data accuracy while the data reduction rate is decreased.
The following images depict the threshold variation process for nonstationary data.
Figure 4a shows the change in light data,
Figure 4b represents the corresponding threshold variation,
Figure 4c shows the variation in voltage intensity, and
Figure 4d shows the voltage threshold variation.
Figure 4e illustrates the variation in count data, while
Figure 4f presents the corresponding threshold variation.
After and are set, we calculate a tenth of the difference between and as the step size. Each time increases or decreases, it changes by one step.
4.3. Experiments on Adaptive Reduction Rate and Reconstruction Accuracy
In this section, the effectiveness of the proposed method is validated for both stationary and nonstationary datasets by comparing it with fixed threshold reduction methods, such as basic Kalman filter and LMS filter reduction methods, as well as the non-threshold reduction method and the critical+PIP reduction method. The proposed method adjusts the reduction rate dynamically based on the data change pattern, and the reduction threshold changes during the reduction process. Due to various external factors that affect the data, the change patterns of single-dimensional sensor data may differ at different stages, leading to differences in data reduction rate and reconstruction accuracy. Therefore, the minimum, maximum, mean, and mode of the threshold values that are adaptively adjusted by the proposed method in different datasets are taken as the fixed threshold values in traditional methods for comparison with basic Kalman filter and LMS filter data reduction methods. The critical+PIP algorithm is a non-threshold data reduction algorithm, and its efficiency is measured by comparing the data reduction rate and data reconstruction accuracy of the critical+PIP algorithm under both stationary and nonstationary datasets.
4.3.1. Experiments on Stationary Attributes Compared with Fixed Threshold
By conducting experiments on temperature and humidity data from Intel Lab data, it is found that the proposed method has a higher data reduction rate than the basic Kalman and LMS filter data reduction methods with threshold values set by mean, median, and mode. As shown in
Table 4 and
Table 5, when the threshold is set as the maximum value, the proposed method has only a slight reduction rate lower than that of Kalman and LMS filter. Moreover, the data reconstruction accuracy of the proposed method is higher than that of traditional Kalman and LMS filter data reduction methods with threshold values set by mean and median.
Through a comparative experiment on the current and power data in the Household Power Consumption dataset, it is found that for stationary datasets, the data reduction rate of the proposed method is higher than that of the basic Kalman filter with mean or median as the threshold. As shown in
Table 6 and
Table 7, the data reconstruction accuracy is better than that of basic Kalman filtering and LMS filtering under different threshold values.
For stationary datasets, as the reduction threshold increases, the data reduction rate of the basic Kalman filter and LMS filter will continue to increase, but the data reconstruction accuracy will decrease. The data reduction rate and reconstruction accuracy of the traditional Kalman filter are both higher than those of the LMS filter. Compared with the proposed method, the data reduction algorithm is a dynamic mechanism for controlling the reduction rate, which can adjust the reduction rate dynamically according to the changing patterns of the data. The proposed method achieves higher data reconstruction accuracy when the data reduction rate is equal to that of the traditional Kalman filter.
4.3.2. Experiments on Stationary Attributes Compared with Critical+PIP
The critical+PIP algorithm is a non-threshold-based data reduction algorithm. This article measures the efficiency of the proposed algorithm by comparing its data reduction rate and data reconstruction accuracy with those of another algorithm in the Intel Lab data dataset, specifically for the current and power attributes of Household Power Consumption and the temperature and humidity attributes.
The experimental results in
Table 8 show that the data reconstruction accuracy of the critical+PIP algorithm is unstable. To observe the difference in data reconstruction accuracy between the two algorithms, the data reduction rate is controlled between 20% and 80%. For the Intel Lab data, the temperature data reconstruction accuracy of the critical+PIP algorithm decreased from 91.28% to 69.73%, and the humidity data reconstruction accuracy decreased from 68.91% to 52.60%. When processing current data, the data reconstruction accuracy of the critical+PIP algorithm decreased from 99.12% to 73.97%, and when processing power data, the data reconstruction accuracy decreased from 97.10% to 60.49%. The proposed method achieves higher and more stable data reconstruction accuracy with the same data reduction rate.
4.3.3. Experiments on Nonstationary Attributes Compared with Fixed Threshold
In the following, we will compare the fixed threshold methods in the nonstationary attributes. Analysis of the Intel Lab light data indicates that the data values remain mostly unchanged most of the time. As shown in
Table 9, the proposed method exhibits a similar data reduction rate and data reconstruction accuracy to those of the other compared methods. We calculate the transmission of
when computing the transmission of our method. Therefore, the effect may not be significant in light attribute in Intel Lab data, which may be caused by the fact that there are fewer nonlinear cases and
does not need to be transmitted.
In the case of voltage data shown in
Table 10 and count data shown in
Table 11, the proposed method achieves a data reduction rate that is 10% lower than that of the traditional Kalman filter with a mean value threshold for nonstationary datasets. However, the data reconstruction accuracy of the proposed method is better than that of the traditional Kalman filter and LMS filter under different threshold conditions.
When dealing with nonstationary datasets, the data reduction rate of the traditional Kalman filter and LMS filter will continue to increase with the increase in the reduction threshold. Still, the data reconstruction accuracy will be very low, and sudden changes in data anomalies cannot be observed in a timely manner. In the case of processing nonstationary datasets, the proposed method can automatically reduce the data reduction rate, maintain sensitivity, and sustain high data accuracy.
4.3.4. Experiments on Nonstationary Attributes Compared with Critical+PIP
This subsection will measure the efficiency of the proposed algorithm by comparing the data reconstruction accuracy of critical+PIP data reduction methods with the same data reduction rate. As Shown in the
Table 12, the data reconstruction accuracy of the critical+PIP algorithm is not stable. When dealing with light data, the data reduction rate is controlled from 50% to 80%, the data reconstruction accuracy of critical+PIP decreases from 88.48% to 77.70%, and the data reconstruction accuracy of the proposed method decreases from 96.69% to 93.05%. When processing voltage data, the data reduction rate is controlled to 20–80%, and the data reconstruction accuracy of critical+PIP data decreases from 92.36% to 61.73%, while the data reconstruction accuracy of the proposed method in this paper decreases from 97.35% to 80.36%. It can be seen that the proposed method in this paper has higher data reconstruction accuracy with the same data reduction rate, while the data reconstruction accuracy is more stable, and the data reduction effect is better. The data reconstruction accuracy of the critical+PIP algorithm decreases from 75% to 64.54%, and the data reconstruction accuracy of this paper decreases from 83.24% to 67.70%. This paper’s data reconstruction accuracy is higher with the same data reduction rate. Meanwhile, the data reconstruction accuracy of this paper’s method is more stable than that of the critical+PIP algorithm when facing a nonstationary dataset.