SPARK環境搭建-WINDOWS版本
轉載: Spark環境搭建-WIndows版本這段時間在看Scala語言方面的資料,接觸到了Spark,于是昨天下午在公司,把Spark的環境搭建起來了。安裝的時候陪到了一個問題,在網上沒有找到解決方案,于是自己查了一下原因。現在做一下筆記。
1. spark的下載文件可以在官方找到,地址:http://spark.incubator.apache.org/downloads.html ,這次裝的是截至目前為止,最新的版本:0.9
2. 下載完以后,直接解壓到指定的路徑,例如,d:/programs
3. 安裝scala,并制定Scala_Home路徑,scala安裝請查看官網
4. 按照Spark官方的安裝指南,在解壓的目錄下,運行
sbt/sbt package
命令就可以。
但是這是針對linux和OS X系統的,在windows下運行這條命令,會報錯:
not a valid command
這個問題是因為,spark知道的sbt腳本無法在windows下運行,只要在網上下載一個windows版本的sbt,然后將里面的文件拷貝到Spark目錄下的sbt (http://www.scala-sbt.org/),然后在運行命令,安裝就會成功。
試試spark-shell
1 scala> val textFile = sc.textFile("README.md")
2 14/02/14 16:38:12 INFO MemoryStore: ensureFreeSpace(35480) called with curMem=177376, maxMem=308713881
3 14/02/14 16:38:12 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 34.6 KB, free 294.2 MB)
4
5 textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[16] at textFile at <console>:12
6
7 scala> textFile.count
8 14/02/14 16:38:14 INFO FileInputFormat: Total input paths to process : 1
9 14/02/14 16:38:14 INFO SparkContext: Starting job: count at <console>:15
10 14/02/14 16:38:14 INFO DAGScheduler: Got job 7 (count at <console>:15) with 1 output partitions (allowLocal=false)
11 14/02/14 16:38:14 INFO DAGScheduler: Final stage: Stage 7 (count at <console>:15)
12 14/02/14 16:38:14 INFO DAGScheduler: Parents of final stage: List()
13 14/02/14 16:38:14 INFO DAGScheduler: Missing parents: List()
14 14/02/14 16:38:14 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[16] at textFile at <console>:12), which has no missin
15 g parents
16 14/02/14 16:38:14 INFO DAGScheduler: Submitting 1 missing tasks from Stage 7 (MappedRDD[16] at textFile at <console>:12)
17
18 14/02/14 16:38:14 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
19 14/02/14 16:38:14 INFO TaskSetManager: Starting task 7.0:0 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
20 14/02/14 16:38:14 INFO TaskSetManager: Serialized task 7.0:0 as 1560 bytes in 1 ms
21 14/02/14 16:38:14 INFO Executor: Running task ID 5
22 14/02/14 16:38:14 INFO BlockManager: Found block broadcast_5 locally
23 14/02/14 16:38:14 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
24 14/02/14 16:38:14 INFO Executor: Serialized size of result for 5 is 563
25 14/02/14 16:38:14 INFO Executor: Sending result for 5 directly to driver
26 14/02/14 16:38:14 INFO Executor: Finished task ID 5
27 14/02/14 16:38:14 INFO TaskSetManager: Finished TID 5 in 6 ms on localhost (progress: 0/1)
28 14/02/14 16:38:14 INFO DAGScheduler: Completed ResultTask(7, 0)
29 14/02/14 16:38:14 INFO TaskSchedulerImpl: Remove TaskSet 7.0 from pool
30 14/02/14 16:38:14 INFO DAGScheduler: Stage 7 (count at <console>:15) finished in 0.009 s
31 14/02/14 16:38:14 INFO SparkContext: Job finished: count at <console>:15, took 0.012329265 s
32 res10: Long = 119
33
34 scala> textFile.first
35 14/02/14 16:38:24 INFO SparkContext: Starting job: first at <console>:15
36 14/02/14 16:38:24 INFO DAGScheduler: Got job 8 (first at <console>:15) with 1 output partitions (allowLocal=true)
37 14/02/14 16:38:24 INFO DAGScheduler: Final stage: Stage 8 (first at <console>:15)
38 14/02/14 16:38:24 INFO DAGScheduler: Parents of final stage: List()
39 14/02/14 16:38:24 INFO DAGScheduler: Missing parents: List()
40 14/02/14 16:38:24 INFO DAGScheduler: Computing the requested partition locally
41 14/02/14 16:38:24 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
42 14/02/14 16:38:24 INFO SparkContext: Job finished: first at <console>:15, took 0.002671379 s
43 res11: String = # Apache Spark
44
45 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
46 linesWithSpark: org.apache.spark.rdd.RDD[String] = FilteredRDD[17] at filter at <console>:14
47
48 scala> textFile.filter(line=> line.contains("spark")).count
49 14/02/14 16:38:37 INFO SparkContext: Starting job: count at <console>:15
50 14/02/14 16:38:37 INFO DAGScheduler: Got job 9 (count at <console>:15) with 1 output partitions (allowLocal=false)
51 14/02/14 16:38:37 INFO DAGScheduler: Final stage: Stage 9 (count at <console>:15)
52 14/02/14 16:38:37 INFO DAGScheduler: Parents of final stage: List()
53 14/02/14 16:38:37 INFO DAGScheduler: Missing parents: List()
54 14/02/14 16:38:37 INFO DAGScheduler: Submitting Stage 9 (FilteredRDD[18] at filter at <console>:15), which has no missin
55 g parents
56 14/02/14 16:38:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 9 (FilteredRDD[18] at filter at <console>:15)
57
58 14/02/14 16:38:37 INFO TaskSchedulerImpl: Adding task set 9.0 with 1 tasks
59 14/02/14 16:38:37 INFO TaskSetManager: Starting task 9.0:0 as TID 6 on executor localhost: localhost (PROCESS_LOCAL)
60 14/02/14 16:38:37 INFO TaskSetManager: Serialized task 9.0:0 as 1642 bytes in 0 ms
61 14/02/14 16:38:37 INFO Executor: Running task ID 6
62 14/02/14 16:38:37 INFO BlockManager: Found block broadcast_5 locally
63 14/02/14 16:38:37 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
64 14/02/14 16:38:37 INFO Executor: Serialized size of result for 6 is 563
65 14/02/14 16:38:37 INFO Executor: Sending result for 6 directly to driver
66 14/02/14 16:38:37 INFO Executor: Finished task ID 6
67 14/02/14 16:38:37 INFO TaskSetManager: Finished TID 6 in 10 ms on localhost (progress: 0/1)
68 14/02/14 16:38:37 INFO DAGScheduler: Completed ResultTask(9, 0)
69 14/02/14 16:38:37 INFO TaskSchedulerImpl: Remove TaskSet 9.0 from pool
70 14/02/14 16:38:37 INFO DAGScheduler: Stage 9 (count at <console>:15) finished in 0.010 s
71 14/02/14 16:38:37 INFO SparkContext: Job finished: count at <console>:15, took 0.020335125 s
72 res12: Long = 7
2 14/02/14 16:38:12 INFO MemoryStore: ensureFreeSpace(35480) called with curMem=177376, maxMem=308713881
3 14/02/14 16:38:12 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 34.6 KB, free 294.2 MB)
4
5 textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[16] at textFile at <console>:12
6
7 scala> textFile.count
8 14/02/14 16:38:14 INFO FileInputFormat: Total input paths to process : 1
9 14/02/14 16:38:14 INFO SparkContext: Starting job: count at <console>:15
10 14/02/14 16:38:14 INFO DAGScheduler: Got job 7 (count at <console>:15) with 1 output partitions (allowLocal=false)
11 14/02/14 16:38:14 INFO DAGScheduler: Final stage: Stage 7 (count at <console>:15)
12 14/02/14 16:38:14 INFO DAGScheduler: Parents of final stage: List()
13 14/02/14 16:38:14 INFO DAGScheduler: Missing parents: List()
14 14/02/14 16:38:14 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[16] at textFile at <console>:12), which has no missin
15 g parents
16 14/02/14 16:38:14 INFO DAGScheduler: Submitting 1 missing tasks from Stage 7 (MappedRDD[16] at textFile at <console>:12)
17
18 14/02/14 16:38:14 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
19 14/02/14 16:38:14 INFO TaskSetManager: Starting task 7.0:0 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
20 14/02/14 16:38:14 INFO TaskSetManager: Serialized task 7.0:0 as 1560 bytes in 1 ms
21 14/02/14 16:38:14 INFO Executor: Running task ID 5
22 14/02/14 16:38:14 INFO BlockManager: Found block broadcast_5 locally
23 14/02/14 16:38:14 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
24 14/02/14 16:38:14 INFO Executor: Serialized size of result for 5 is 563
25 14/02/14 16:38:14 INFO Executor: Sending result for 5 directly to driver
26 14/02/14 16:38:14 INFO Executor: Finished task ID 5
27 14/02/14 16:38:14 INFO TaskSetManager: Finished TID 5 in 6 ms on localhost (progress: 0/1)
28 14/02/14 16:38:14 INFO DAGScheduler: Completed ResultTask(7, 0)
29 14/02/14 16:38:14 INFO TaskSchedulerImpl: Remove TaskSet 7.0 from pool
30 14/02/14 16:38:14 INFO DAGScheduler: Stage 7 (count at <console>:15) finished in 0.009 s
31 14/02/14 16:38:14 INFO SparkContext: Job finished: count at <console>:15, took 0.012329265 s
32 res10: Long = 119
33
34 scala> textFile.first
35 14/02/14 16:38:24 INFO SparkContext: Starting job: first at <console>:15
36 14/02/14 16:38:24 INFO DAGScheduler: Got job 8 (first at <console>:15) with 1 output partitions (allowLocal=true)
37 14/02/14 16:38:24 INFO DAGScheduler: Final stage: Stage 8 (first at <console>:15)
38 14/02/14 16:38:24 INFO DAGScheduler: Parents of final stage: List()
39 14/02/14 16:38:24 INFO DAGScheduler: Missing parents: List()
40 14/02/14 16:38:24 INFO DAGScheduler: Computing the requested partition locally
41 14/02/14 16:38:24 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
42 14/02/14 16:38:24 INFO SparkContext: Job finished: first at <console>:15, took 0.002671379 s
43 res11: String = # Apache Spark
44
45 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
46 linesWithSpark: org.apache.spark.rdd.RDD[String] = FilteredRDD[17] at filter at <console>:14
47
48 scala> textFile.filter(line=> line.contains("spark")).count
49 14/02/14 16:38:37 INFO SparkContext: Starting job: count at <console>:15
50 14/02/14 16:38:37 INFO DAGScheduler: Got job 9 (count at <console>:15) with 1 output partitions (allowLocal=false)
51 14/02/14 16:38:37 INFO DAGScheduler: Final stage: Stage 9 (count at <console>:15)
52 14/02/14 16:38:37 INFO DAGScheduler: Parents of final stage: List()
53 14/02/14 16:38:37 INFO DAGScheduler: Missing parents: List()
54 14/02/14 16:38:37 INFO DAGScheduler: Submitting Stage 9 (FilteredRDD[18] at filter at <console>:15), which has no missin
55 g parents
56 14/02/14 16:38:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 9 (FilteredRDD[18] at filter at <console>:15)
57
58 14/02/14 16:38:37 INFO TaskSchedulerImpl: Adding task set 9.0 with 1 tasks
59 14/02/14 16:38:37 INFO TaskSetManager: Starting task 9.0:0 as TID 6 on executor localhost: localhost (PROCESS_LOCAL)
60 14/02/14 16:38:37 INFO TaskSetManager: Serialized task 9.0:0 as 1642 bytes in 0 ms
61 14/02/14 16:38:37 INFO Executor: Running task ID 6
62 14/02/14 16:38:37 INFO BlockManager: Found block broadcast_5 locally
63 14/02/14 16:38:37 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
64 14/02/14 16:38:37 INFO Executor: Serialized size of result for 6 is 563
65 14/02/14 16:38:37 INFO Executor: Sending result for 6 directly to driver
66 14/02/14 16:38:37 INFO Executor: Finished task ID 6
67 14/02/14 16:38:37 INFO TaskSetManager: Finished TID 6 in 10 ms on localhost (progress: 0/1)
68 14/02/14 16:38:37 INFO DAGScheduler: Completed ResultTask(9, 0)
69 14/02/14 16:38:37 INFO TaskSchedulerImpl: Remove TaskSet 9.0 from pool
70 14/02/14 16:38:37 INFO DAGScheduler: Stage 9 (count at <console>:15) finished in 0.010 s
71 14/02/14 16:38:37 INFO SparkContext: Job finished: count at <console>:15, took 0.020335125 s
72 res12: Long = 7
另外Spark官網提供了入門的四段視頻,但是國內被墻了,無法觀看youtube,我把這四段視頻放到了土豆網,大家可以看看。
Spark Screencast 1 – 搭建Spark環境
Spark Screencast 2 – Spark文檔總覽
Spark Screencast 3 – 轉換和緩存
Spark Screencast 4 – Scala獨立任務
-----------------------------------------------------
Silence, the way to avoid many problems;
Smile, the way to solve many problems;
posted on 2014-02-14 16:21 Chan Chen 閱讀(3034) 評論(0) 編輯 收藏 所屬分類: Scala / Java