1. 安装spark
访问
我这里下载的版本是: spark-2.1.1-bin-hadoop2.7.tgz
2. 配置spark
~$ cd
~$ tar xf spark-2.1.1-bin-hadoop2.7.tgz
~$ mv spark-2.1.1-bin-hadoop2.7 spark
~$ vim .bash_profile
添加如下代码到配置文件中
export SPARK_HOME=spark所在目录
export PATH=$SPARK_HOME/bin:$PATH~$ source .bash_profile 使之生效
3. 创建虚拟环境
~$ virtualenv sparkenv(如果没有安装virtualenv 请先pip install virtualenv)
~$ source sparkenv/bin/active
~$ pip install PySpark # (安装PySpark) 这个时候会提示Requirement already satisfied: PySpark in ./spark/python 实际上并没有添加到python的PYTHONPATH里面去,需要执行一下源码安装。
~$ cd ~/spark/python
~$ python setup.py install 即可安装成功
4. 示例在pycharm运行
#!/usr/bin/env python# -*- coding: utf-8 -*-"""__title__ = ''__author__ = ''__mtime__ = '26/06/2017'"""import timefrom pyspark import SparkContextsc = SparkContext('local', 'pyspark')def isprime(n): """ check if integer n is a prime """ # make sure n is a positive integer n = abs(int(n)) # 0 and 1 are not primes if n < 2: return False # 2 is the only even prime number if n == 2: return True # all other even numbers are not primes if not n & 1: return False # range starts with 3 and only needs to go up the square root of n # for all odd numbers for x in range(3, int(n**0.5)+1, 2): if n % x == 0: return False return Truestart = time.time()nums = sc.parallelize(xrange(1000000))result = nums.filter(isprime).count()end = time.time()print("primes total:{}, cost: {}s".format(result, end-start))
运行结果如下:
5. 更多示例
https://github.com/fdrong/sparkdemo
6. 参考链接:
https://github.com/apache/spark/tree/master
http://blog.jobbole.com/86232/