1 Introduction
The failure resilience of distributed machinelearning systems has attracted increasing attention
(Blanchard et al., 2017; Chen et al., 2017) in the community. Larger clusters can accelerate training. However, this makes the distributed system more vulnerable to different kinds of failures or even attacks, including crashes and computation errors, stalled processes, or compromised subsystems (Harinath et al., 2017). Thus, failure/attack resilience is becoming more and more important for distributed machinelearning systems, especially for largescale deep learning
(Dean et al., 2012; McMahan et al., 2017).In this paper, we consider the most general failure model, Byzantine failures (Lamport et al., 1982), where the attackers can know any information of the other processes, and attack any value in transmission. To be more specific, the data transmission between the machines can be replaced by arbitrary values. Under such model, there are no constraints on the failures or attackers.
The distributed training framework studied in this paper is the Parameter S
erver (PS). The PS architecture is composed of the server nodes and the worker nodes. The server nodes maintain a global copy of the model, aggregate the gradients from the workers, apply the gradients to the model, and broadcast the latest model to the workers. The worker nodes pull the latest model from the server nodes, compute the gradients according to the local portion of the training data, and send the gradients to the server nodes. The entire dataset and the corresponding workload is distributed to multiple worker nodes, thus parallelizing the computation via partitioning the dataset. There exist several distributed machine learning systems using the PS architecture. For instance, Tensorflow
(Abadi et al., 2016), CNTK (Seide & Agarwal, 2016), and MXNet (Chen et al., 2015) implement internal PS’s.In this paper, we study the Byzantine resilience of synchronous Stochastic Gradient Descent (SGD), which is a popular class of learning algorithms using PS architecture. Its variants are widely used in training deep neural networks
(Kingma & Ba, 2014; Mukkamala & Hein, 2017). Such algorithms always wait to collect gradients from all the worker nodes before moving on to the next iteration.th row represents the gradient vector produced by the
th worker. The th column represents the th dimension of the gradients. A shadow block represents that the corresponding value is replaced by a Byzantine value. In the two examples, the maximal number of Byzantine values for each dimension is . For the classic Byzantine model, all the Byzantine values must lie in the same workers (rows), while for the generalized Byzantine model there is no such constraint. Thus, (a) is a special case of (b).The failure model can be described by using an matrix consisting of the dimensional gradients produced by workers, as visualized in Figure 1. A previous work (Blanchard et al., 2017) discusses a special case of our failure model, where the Byzantine values must lie in the same rows (workers) as shown in Figure 1(a). Our failure model generalize the classic Byzantine failure model by placing the Byzantine values anywhere in the matrix without any constraint.
There are many possible types of attacks. In general, the attackers want to disturb the model training, i.e., make SGD converge slowly or converge to a bad solution. We list some of the possible attacks in the following three paragraphs.
We name the most general type of attacks as gamber. The attackers can change a portion of data on the communication media such as the wires or the network interfaces. The attackers randomly pick the data and maliciously change them (e.g., multiply them by a large negative value). As a result, on the server nodes, the collected gradients are partially replaced by arbitrary values.
Another possible type of attack is called omniscient. The attackers are supposed to know the gradients sent by all the workers, and use the sum of all the gradients, scaled by a large negative value, to replace some of the gradient vectors. The goal is to mislead SGD to go into an opposite direction with a large step size.
There are also some weaker attacks, such as Gaussian attack
, where some of the gradient vectors are replaced by random vectors sampled from a Gaussian distribution with large variances. Such attackers do not require any information from the workers.
With the generalized Byzantine failure model, we ask that using what aggregation rules and on what conditions, the synchronous SGD can still converge to good solutions. We propose novel medianbased aggregation rules, with which SGD is Byzantine resilient on a certain condition: for each dimension, in all the values provided by the workers, the number of Byzantine values must be less than half of . Such Byzantine resilience property is called “dimensional Byzantine resilience”. The main contributions of this paper are listed below:

We propose three aggregation rules for synchronous SGD with provable convergence to critical points: geometric median (Definition 6), marginal median (Definition 7), and “mean around median” (Definition 8). As far as we know, this paper is the first to theoretically and empirically study medianbased aggregation rules under nonconvex settings.

We show that the three proposed robust aggregation rules have low computation cost. The time complexities are nearly linear, which are in the same order of the default choice for nonByzantine aggregation, i.e., averaging.

We formulate the dimensional Byzantine resilience property, and prove that marginal median and “mean around median” are dimensional Byzantineresilient (Definition 5). As far as we know, this paper is the first one to study generalized Byzantine failures and dimensional Byzantine resilience for synchronous SGD.
2 Model
We consider the Parameter Server architecture consisting of workers. The goal is to find the optimizer of the following problem:
where the expectation is with respect to the random variable
. The PS executes synchronous SGD for distributed training. In each round, the server nodes collect gradients from the workers. In the round, the server nodes aggregate the gradients from the workers, and broadcast the updated parameters to the workers. is the vector sent by the th worker in the th round, potentially Byzantine. Using aggregation rule , the server nodes update the parameters as follows:where is the learning rate. The worker nodes pull the latest parameters from the server nodes, compute the gradients according to the local portion of the training data, and send the gradients to the server nodes. Without the Byzantine failures, the th worker will calculate , where . With Byzantine failures, are partially replaced by any arbitrary values, which results .
Since the Byzantine failure assumes the worst cases, the attackers may have full knowledge of the entire system, including the gradients generated by all the workers, and the aggregation rule . The malicious processes can even collaborate with each other (Lynch, 1996).
3 Byzantine Resilience
In this section, we formally define the classic Byzantine resilience property and its generalized version: dimensional Byzantine resilience.
Suppose that in a specific round, the correct vectors are i.i.d samples drawn from the random variable , where
is an unbiased estimator of the gradient. Thus,
, for any . We simplify the notations by ignoring the index of round .We first introduce the classic Byzantine model proposed by Blanchard et al. (2017). With the Byzantine workers, the actual vectors received by the server nodes are as follows:
Definition 1 (Classic Byzantine Model).
(1) 
Note that the indices of Byzantine workers can change throughout different rounds. Furthermore, the server nodes are not aware of which workers are Byzantine. The only information given is the number of Byzantine workers, if necessary.
We directly use the same definition of classic Byzantine resilience proposed in (Blanchard et al., 2017).
Definition 2.
(Classic Byzantine Resilience). Let be any angular value, and any integer . Let be any i.i.d. random vectors in , , with . Let be the set of vectors, of which up to of them are replaced by arbitrary vectors in , while the others still equal to the corresponding . Aggregation rule is said to be classic Byzantine resilient if satisfies (i) and (ii) for , is bounded above by a linear combination of terms with .
The baseline algorithm Krum, denoted as (Blanchard et al., 2017), is defined as follows
Definition 3.
where is the indices of the nearest neighbours of in measured by Euclidean distance.
The Krum aggregation is classic Byzantine resilient under certain assumptions:
Lemma 1 (Blanchard et al. (2017)).
Let be any i.i.d. random dimensional vectors s.t. , with and . of are replaced by arbitrary dimensional vectors . If and , where
then the function is classic Byzantine resilient where is defined by .
The generalized Byzantine model is denoted as:
Definition 4 (Generalized Byzantine Model).
(2) 
where is the th dimension of the vector .
Based on the Byzantine model above, we introduce a generalized Byzantine resilience property, dimensional Byzantine resilience, which is defined as follows:
Definition 5.
(Dimensional Byzantine Resilience). Let be any angular value, and any integer . Let be any i.i.d. random vectors in , , with . Let be the set of vectors. For each dimension, up to of the values are replaced by arbitrary values, i.e., for dimension , of are Byzantine, where is the th dimension of the vector . Aggregation rule is said to be dimensional Byzantine resilient if satisfies (i) and (ii) for , is bounded above by a linear combination of terms with .
Note that classic Byzantine resilience is a special case of dimensional Byzantine resilience. For classic Byzantine resilience defined in Definition 2, all the Byzantine values must lie in the same subset of workers, as shown in Figure 1(a).
In the following theorems, we show that Mean and Krum are not dimensional Byzantine resilient. The proofs are provided in the appendix.
Theorem 1.
Averaging is not dimensional Byzantine resilient.
Theorem 2.
Any aggregation rule that outputs is not dimensional Byzantine resilient.
Note that Krum chooses the vector with the minimal score. Thus, based on the theorem above, we obtain the following corollary.
Corollary 1.
is not dimensional Byzantine resilient.
If an aggregation rule is dimensional/classic Byzantine resilient with satisfied assumptions, it converges to critical points almost surely, by reusing the Proposition 2 in (Blanchard et al., 2017). We provide the following lemma without proof.
Lemma 2 (Blanchard et al. (2017)).
Assume that (i) the cost function is three times differentiable with continuous derivatives, and is nonnegative, ; (ii) the learning rates satisfy and ; (iii) the gradient estimator satisfies and , for some constants , ; (iv) there exists a constant such that for all x where ; (v) finally, beyond a certain horizon, , there exist and such that , and . Then the sequence of gradients converges almost surely to zero, if the aggregation rule satisfies Byzantine Resilience defined in Definition 2 or 5.
4 Medianbased Aggregation
With the Byzantine failure model defined in Equation (1) and (2), we propose three medianbased aggregation rules, which are Byzantine resilient under certain conditions.
4.1 Geometric Median
The geometric median is used as a robust estimator of mean (Chen et al., 2017).
Definition 6.
The geometric median of , denoted by , is defined as
The following theorem shows the classic Byzantine resilience of geometric median. A proof is provided in the appendix.
Theorem 3.
Let be any i.i.d. random dimensional vectors s.t. , with and . of are replaced by arbitrary dimensional vectors . If and , where , then the function is classic Byzantine resilient where is defined by .
4.2 Marginal Median
The marginal median is another generalization of onedimensional median.
Definition 7.
We define the marginal median aggregation rule as
where for any , the th dimension of is , is the th dimension of the vector , is the onedimensional median.
The following theorem claims that by using , the resulting vector is dimensional Byzantine resilient. A proof is provided in the appendix.
Theorem 4.
Let be any i.i.d. random dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional Byzantine resilient where is defined by .
4.3 Beyond Median
We can also utilize more values for each dimension along with the median, if is given or easily estimated. To be more specific, for each dimension, we take the average of the values nearest to the median (including the median itself). We call the resulting aggregation rule “mean around median”, which is defined as follows:
Definition 8.
We define the meanaroundmedian aggregation rule as
where for any , the th dimension of is , is the indices of the top values lying in nearest to the median , is the th dimension of the vector .
We show that is dimensional Byzantine resilient.
Theorem 5.
Let be any i.i.d. random dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional Byzantine resilient where is defined by .
The meanaroundmedian aggregation can be viewed as a trimmed average centering at the median, which filters out the values far away from the median.
4.4 Time Complexity
For geometric median , there are no closedform solutions. The approximate geometric median can be computed in time (Cohen et al., 2016), which is nearly linear to . To compute the marginal median , we only need to compute the median value of each dimension. The simplest way is to apply any sorting algorithm to each dimension, which yields the time complexity . To obtian median values, there also exists an algorithm called selection algorithm (Blum et al., 1973) with average time complexity ( in the worst case). Thus, we can get the marginal median with time complexity on average, which is in the same order of using mean value for aggregation. For , the computation additional to computing the marginal median takes linear time . Thus, the time complexity is the same as . Note that for Krum and MultiKrum, the time complexity is (Blanchard et al., 2017).
Dataset  # train  # test  # rounds  Batchsize  Evaluation metric  

MNIST (Loosli et al., 2007)  60k  10k  0.1  500  32  top1 accuracy 
CIFAR10 (Krizhevsky & Hinton, 2009)  50k  10k  5e4  4000  128  top3 accuracy 
Top1 accuracy of MLP on MNIST with Gaussian Attack. 6 out of 20 gradient vectors are replaced by i.i.d. random vectors drawn from a Gaussian distribution with 0 mean and 200 standard deviation.
with probability 0.05%.
5 Experiments
In this section, we evaluate the convergence and Byzantine resilience properties of the proposed algorithms. We consider two image classification tasks: handwritten digits classification on MNIST dataset using multilayer perceptron (MLP) with two hidden layers, and object recognition on convolutional neural network (CNN) with five convolutional layers and two fullyconnected layers. The details of these two neural networks can be found in the appendix. There are
worker processes. We repeat each experiment for ten times and take the average. To make the conditions as fair as possible for all the algorithms, we ensure that all the algorithms are run with the same set of random seeds. The details of the datasets and the default hyperparameters of the corresponding models are listed in Table
1. We use top1 or top3 accuracy on testing sets (disjoint with the training sets) as evaluation metrics.The baseline aggregation rules are Mean, Medoid, Krum (Definition 3), and MultiKrum. Medoid, defined as follows, is a computationefficient version of geometric median.
Definition 9.
The medoid of , denoted by , is defined as
MultiKrum is a variant of Krum defined in Blanchard et al. (2017), which takes the average on several vectors selected by multiple rounds of Krum. We compare these baseline algorithms with the proposed algorithms: geometric median (GeoMed defined in Definition 6), marginal median (MarMed defined in Definition 7), and “mean around median” (MeaMed defined in Definition 8) under different settings in the following subsections.
Note that all the experiments of CNN on CIFAR10 show similar results with the experiments of MLP on MNIST. Thus, we only show the results of CNN in Section 5.5 as an example. The remaining results are provided in the appendix.
5.1 Convergence without Byzantine Failures
First, we evaluate the convergence without Byzantine failures. The goal is to empirically evaluate the bias and variance caused by the robust aggregation rules.
In Figure 10, we show the top1 accuracy on the testing set of MNIST. The gaps between different algorithms are tiny. Among all the algorithms, MultiKrum, GeoMed, and MeaMed have the least bias. They act just the same as averaging. converges slightly slower. Medoid and Krum both have slowest convergence.
5.2 Gaussian Attack
We test classic Byzantine resilience in this experiment. We consider the attackers that replace some of the gradient vectors with Gaussian random vectors with zero mean and isotropic covariance matrix with standard deviation 200. We refer to this kind of attack as Gaussian Attack. Within the figure, we also include the averaging without Byzantine failures as a baseline. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 3. As expected, averaging is not Byzantine resilient. The gaps between all the other algorithms are still tiny. GeoMed and MeaMed performs like there are no Byzantine failures at all. MultiKrum and MarMed converges slightly slower. Medoid and Krum performs worst. Although Medoid is not Byzantine resilient, the Gaussian attack is weak enough so that Medoid is still effective.
5.3 Omniscient Attack
We test classic Byzantine resilience in this experiment. This kind of attacker is assumed to know the all the correct gradients. For each Byzantine gradient vector, the gradient is replaced by the negative sum of all the correct gradients, scaled by a large constant (1e20 in the experiments). Roughly speaking, this attack tries to make the parameter server go into the opposite direction with a long step. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 4. MeaMed still performs just like there is no failure. MultiKrum is not as good as MeaMed, but the gap is small. Krum converges slower but still converges to the same accuracy. However, GeoMed and MarBed converge to bad solutions. Mean and Medoid are not tolerant to this attack.
5.4 Bitflip Attack
We test dimensional Byzantine resilience in this experiment. Knowing the information of other workers can be difficult in practice. Thus, we use more realistic scenario in this experiment. The attacker only manipulates some individual floating numbers by flipping the 22th, 30th, 31th and 32th bits. Furthermore, we test dimensional Byzantine resilience in this experiment. For each of the first 1000 dimensions, 1 of the 20 floating numbers is manipulated using the bitflip attack. The results are shown in Figure 5. As expected, only MarMed and MeaMed are dimensional Byzantine resilient.
Note that for Krum and MultiKrum, their assumption requires the number of Byzantine vectors to satisfy , which means in our experiments. However, because each gradient is partially manipulated, all the vectors are Byzantine, which breaks the assumption of the Krumbased algorithms. Furthermore, to compute the distances to the nearest neighbours, must be positive. To test the performance of Krum and MultiKrum, we set for these two algorithms so that they can still be executed. Furthermore, we test whether tuning can make a difference. The results are shown in Figure 8. Obviously, whatever we use, Krumbased algorithms get stuck around bad solutions.
5.5 General Attack with Multiple Servers
We test general Byzantine resilience in this experiment. We evaluate the robust aggregation rules under a more general and realistic type of attack. It is very popular to partition the parameters into disjoint subsets, and use multiple server nodes to storage and aggregate them (Li et al., 2014a, b; Ho et al., 2013). We assume that the parameters are evenly partitioned and assigned to the server nodes. The attacker picks one single server, and manipulates any floating number by multiplying , with probability of . We call this attack gambler, because the attacker randomly manipulate the values, and wish that in some rounds the assumptions/prerequisites of the robust aggregation rules are broken, which crashes the training. Such attack requires less global information, and can be concentrated on one single server, which makes it more realistic and easier to implement.
In Figure 6 and 7, we evaluate the performance of all the robust aggregation rules under the gambler attack. The number of servers is . For Krum, MultiKrum and MeaMed, the estimated Byzantine number is set as . We also show the performance of averaging without Byzantine values as the benchmark. It is shown that only marginal median MarMed and “mean around median” MeaMed survive under this attack. The convergence is slightly slower than the averaging without Byzantine values, but the gaps are small.
5.6 Discussion
As expected, mean aggregation is not Byzantine resilient. Although medoid is not Byzantine resilient, as proved by Blanchard et al. (2017), it can still make reasonable progress under some attacks such as Gaussian attack. Krum, MultiKrum, and GeoMed are classic Byzantine resilient but not dimensional Byzantine resilient. MarMed and MeaMed are dimensional Byzantine resilient. However, under omniscient attack, MarMed suffers from larger variances, which slow down the convergence.
The gambler attack shows the true advantage of dimensional Byzantine resilience: higher probability of survival. Under such attack, chances are that the assumptions/prerequisites of MarMed and MeaMed may still get broken. However, their probability of crashing is less than the other algorithms because dimensional Byzantine resilience generalizes classic Byzantine resilience. An interesting observation is that MarMed is slightly better than MeaMed under gambler attack. That is because the estimation of is not accurate, which will cause some unpredictable behavior for MeaMed. We choose because it is the maximal value we can take for Krum and MultiKrum.
It is obvious that MeaMed performs best in almost all the cases. MultiKrum is also good, except that it is not dimensional Byzantine resilient. The reason why MeaMed and MultiKrum have better performance is that they utilize the extra information of the number of Byzantine values. Note that MeaMed not only performs just as well as or even better than MultiKrum, but also has lower time complexity.
Marginal median MarMed has the cheapest computation. Its worst case, omniscient attack, is hard to implement in reality. Thus, for most applications, we suggest MarMed as an easytoimplement aggregation rule with robust performance, which (importantly) does not require knowledge of the number of byzantine values.
6 Related Works
There are few papers studying Byzantine resilience for machine learning algorithms. Our work is closely related to Blanchard et al. (2017). Another paper (Chen et al., 2017) proposed grouped geometric median for Byzantine resilience, with strongly convex functions.
Our approach offers the following important advantages over the previous work.

Less prior knowledge required. Both geometric median and marginal median do not require , the number of Byzantine workers, to be given, while Krum needs to calculate the sum of Euclidean distances of the nearest neighbours. Furthermore, when is known or well estimated, MeaMed show better robustness than Krum and MultiKrum in most cases.

Better support for multiple server nodes. If the entire set of parameters is disjointly partitioned and stored on multiple server nodes, marginal median and “mean around median” need no additional communication, while Krum and geometric median requires communication among the server nodes.
7 Conclusion
We investigate the generalized Byzantine resilience of parameter server architecture. We proposed three novel medianbased aggregation rules for synchronous SGD. The algorithms have low time complexity and provable convergence to critical points. Our empirical results show good performance in practice.
References
 Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, Kudlur, Manjunath, Levenberg, Josh, Monga, Rajat, Moore, Sherry, Murray, Derek Gordon, Steiner, Benoit, Tucker, Paul A., Vasudevan, Vijay, Warden, Pete, Wicke, Martin, Yu, Yuan, and Zhang, Xiaoqiang. Tensorflow: A system for largescale machine learning. In OSDI, 2016.
 Blanchard et al. (2017) Blanchard, Peva, Guerraoui, Rachid, Stainer, Julien, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pp. 118–128, 2017.
 Blum et al. (1973) Blum, Manuel, Floyd, Robert W, Pratt, Vaughan, Rivest, Ronald L, and Tarjan, Robert E. Time bounds for selection. Journal of computer and system sciences, 7(4):448–461, 1973.
 Chen et al. (2015) Chen, Tianqi, Li, Mu, Li, Yutian, Lin, Min, Wang, Naiyan, Wang, Minjie, Xiao, Tianjun, Xu, Bing, Zhang, Chiyuan, and Zhang, Zheng. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
 Chen et al. (2017) Chen, Yudong, Su, Lili, and Xu, Jiaming. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491, 2017.

Cohen et al. (2016)
Cohen, Michael B, Lee, Yin Tat, Miller, Gary, Pachocki, Jakub, and Sidford,
Aaron.
Geometric median in nearly linear time.
In
Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing
, pp. 9–21. ACM, 2016.  Dean et al. (2012) Dean, Jeffrey, Corrado, Gregory S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc’Aurelio, Senior, Andrew W., Tucker, Paul A., Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
 Harinath et al. (2017) Harinath, Depavath, Satyanarayana, P, and Murthy, MV Ramana. A review on security issues and attacks in distributed systems. Journal of Advances in Information Technology, 8(1), 2017.
 Ho et al. (2013) Ho, Qirong, Cipar, James, Cui, Henggang, Lee, Seunghak, Kim, Jin Kyu, Gibbons, Phillip B., Gibson, Garth A., Ganger, Gregory R., and Xing, Eric P. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems, 2013:1223–1231, 2013.
 Kingma & Ba (2014) Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
 Lamport et al. (1982) Lamport, Leslie, Shostak, Robert E., and Pease, Marshall C. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4:382–401, 1982.
 Li et al. (2014a) Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, BorYiing. Scaling distributed machine learning with the parameter server. In OSDI, 2014a.
 Li et al. (2014b) Li, Mu, Andersen, David G., Smola, Alexander J., and Yu, Kai. Communication efficient distributed machine learning with the parameter server. In NIPS, 2014b.

Loosli et al. (2007)
Loosli, Gaëlle, Canu, Stéphane, and Bottou, Léon.
Training invariant support vector machines using selective sampling.
Large scale kernel machines, pp. 301–320, 2007.  Lynch (1996) Lynch, Nancy A. Distributed algorithms. Morgan Kaufmann, 1996.
 McMahan et al. (2017) McMahan, H. Brendan, Moore, Eider, Ramage, Daniel, Hampson, Seth, and y Arcas, Blaise Aguera. Communicationefficient learning of deep networks from decentralized data. In AISTATS, 2017.
 Minsker et al. (2015) Minsker, Stanislav et al. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.

Mukkamala & Hein (2017)
Mukkamala, Mahesh Chandra and Hein, Matthias.
Variants of rmsprop and adagrad with logarithmic regret bounds.
In ICML, 2017.  Seide & Agarwal (2016) Seide, Frank and Agarwal, Amit. Cntk: Microsoft’s opensource deeplearning toolkit. In KDD, 2016.
8 Appendix
In the appendix, we introduce several useful lemmas and use them to derive the detailed proofs of the theorems in this paper.
8.1 Dimensional Byzantine Resilience
Theorem 1.
Averaging is not dimensional Byzantine resilient.
Proof.
We demonstrate a counter example. Consider the case where
(3) 
where , . Thus, the resulting aggregation is . The inner product is always negative under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points. Note that in this counter example, the number of Byzantine values of each dimension is .
Hence, averaging is not dimensional Byzantine resilient with . ∎
Theorem 2.
Any aggregation rule that outputs is not dimensional Byzantine resilient.
Proof.
We demonstrate a counter example. Consider the case where the th dimension of the th vector is manipulated by the malicious workers (e.g. multiplied by an arbitrarily large negative value), where . Thus, up to 1 value of each dimension is Byzantine. However, no matter which vector is chosen, as long as the aggregation is chosen from , the inner product can be arbitrarily large negative value under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points.
Hence, any aggregation rule that outputs is not dimensional Byzantine resilient with . ∎
8.2 Geometric Median
We use the following lemma (Minsker et al., 2015; Cohen et al., 2016) without proof to bound the geometric median.
Lemma 3.
Let denote points in a Hilbert space. Let denote a approximation of their geometric median, i.e., for . For any such that and given , if , then
where , .
Ideally, the geometric median () ignores the second term .
Using the lemma above, we can prove the classic Byzantine resilience of geometric median.
Theorem 3.
Let be any i.i.d. random dimensional vectors s.t. , with and . of are replaced by arbitrary dimensional vectors . If and , where , then the function is classic Byzantine resilient where is defined by .
Proof.
We only need to prove that satisfies the two conditions of classic Byzantine resilience defined in Definition 2.
Condition (i):
Let the sequence be defined as
Let denote the geometric median of . Thus, is the geometric median of . Using Lemma 3, and taking , under the assumption , we obtain
Now, we can bound as follows:
By assumption, , i.e. belongs to a ball centered at with radius . This implies
where .
Condition (ii):
We reuse Lemma 3 by taking , for , and . Thus, we have
Without loss of generality, we denote the sequence as . Thus, there exists a constant such that
Since ’s are i.i.d., we obtain that is bounded above by a linear combination of terms of the form with , which completes the proof of condition (ii). ∎
8.3 Marginal Median
We use the following lemma to bound the onedimensional median.
Lemma 4.
For a sequence composed of Byzantine values and correct values , if (the correct value dominates the sequence), then the median value of this sequence satisfies , .
Proof.
If comes from correct values, then the result is trivial. Thus, we only need to consider the cases where comes from Byzantine values.
If
is odd, then in the sorted sequence, there will be
values on both sides of . However, the number of correct values . Thus, on both sides of , there will be at least one correct value, which yields the desired result.Furthermore, if is even, we can reuse the same technique above to prove . ∎
Theorem 4.
Let be any i.i.d. random dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional Byzantine resilient where is defined by .
Proof.
We only need to prove that satisfies the two conditions of dimensional Byzantine resilience defined in Definition 5.
Condition (i):
Without loss of generality, we assume that , .
For any dimension , let the sequence be defined as
For the th dimension, , the median value .
Thus, we have
Now, we can bound as follows:
By assumption, , i.e. belongs to a ball centered at with radius . This implies
where .
Condition (ii):
By using the equivalence of norms in finite dimension, there exists a constant such that
(equivalence between norm and norm) 
Without loss of generality, we denote the sequence as . Thus, there exists a constant such that
Since ’s are i.i.d., we obtain that is bounded above by a linear combination of terms of the form with , which completes the proof of condition (ii). ∎
8.4 Mean around Median
The following lemma bounds the onedimensional mean around median.
Lemma 5.
Proof.
According to the definition of the mean around median , it is the mean value over the top values in the sequence, nearest to the median . Denote such set of nearest values as . If any satisfies that , then it cannot be in the set of the top nearest values because all the correct values are nearer to (). Since all satisfies , the average over them must also satisfies . ∎
Theorem 5.
Let be any i.i.d. random dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional Byzantine resilient where is defined by .
Proof.
We only need to prove that
Comments
There are no comments yet.