spark入门-白红宇

spark入门

阅读量：7094 次

发布时间：2019-06-28

本文共 1657 字，大约阅读时间需要 5 分钟。

1. 安装spark

访问

我这里下载的版本是： spark-2.1.1-bin-hadoop2.7.tgz

2. 配置spark

~$ cd

~$ tar xf spark-2.1.1-bin-hadoop2.7.tgz

~$ mv spark-2.1.1-bin-hadoop2.7 spark

~$ vim .bash_profile

添加如下代码到配置文件中

export SPARK_HOME=spark所在目录

export PATH=$SPARK_HOME/bin:$PATH

~$ source .bash_profile 使之生效

3. 创建虚拟环境

~$ virtualenv sparkenv(如果没有安装virtualenv 请先pip install virtualenv)

~$ source sparkenv/bin/active

~$ pip install PySpark # (安装PySpark) 这个时候会提示Requirement already satisfied: PySpark in ./spark/python 实际上并没有添加到python的PYTHONPATH里面去，需要执行一下源码安装。

~$ cd ~/spark/python

~$ python setup.py install 即可安装成功

4. 示例在pycharm运行

#!/usr/bin/env python# -*- coding: utf-8 -*-"""__title__ = ''__author__ = ''__mtime__ = '26/06/2017'"""import timefrom pyspark import SparkContextsc = SparkContext('local', 'pyspark')def isprime(n):    """    check if integer n is a prime    """    # make sure n is a positive integer    n = abs(int(n))    # 0 and 1 are not primes    if n < 2:        return False    # 2 is the only even prime number    if n == 2:        return True    # all other even numbers are not primes    if not n & 1:        return False    # range starts with 3 and only needs to go up the square root of n    # for all odd numbers    for x in range(3, int(n**0.5)+1, 2):        if n % x == 0:            return False    return Truestart = time.time()nums = sc.parallelize(xrange(1000000))result = nums.filter(isprime).count()end = time.time()print("primes total:{}, cost: {}s".format(result, end-start))

运行结果如下：