Skip to content

Instantly share code, notes, and snippets.

@tegansnyder
Created January 6, 2017 15:59
Show Gist options
  • Select an option

  • Save tegansnyder/3abdc68679a259f868e026a7a619dcfd to your computer and use it in GitHub Desktop.

Select an option

Save tegansnyder/3abdc68679a259f868e026a7a619dcfd to your computer and use it in GitHub Desktop.

Revisions

  1. tegansnyder created this gist Jan 6, 2017.
    137 changes: 137 additions & 0 deletions pyspark error.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,137 @@
    I'm recieving a strange error on a new install of Spark. I have setup a small 3 node spark cluster on top of an existing hadoop instance. The error I get is the same for any command I try to run on pyspark shell I get the following error:

    ```bash
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/opt/spark/python/pyspark/rdd.py", line 1041, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
    File "/opt/spark/python/pyspark/rdd.py", line 1032, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    File "/opt/spark/python/pyspark/rdd.py", line 906, in fold
    vals = self.mapPartitions(func).collect()
    File "/opt/spark/python/pyspark/rdd.py", line 809, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    File "/opt/spark/python/pyspark/rdd.py", line 2439, in _jrdd
    self._jrdd_deserializer, profiler)
    File "/opt/spark/python/pyspark/rdd.py", line 2374, in _wrap_function
    sc.pythonVer, broadcast_vars, sc._javaAccumulator)
    File "/usr/local/lib/python3.5/site-packages/py4j/java_gateway.py", line 1414, in __call__
    answer, self._gateway_client, None, self._fqn)
    File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
    File "/usr/local/lib/python3.5/site-packages/py4j/protocol.py", line 324, in get_return_value
    format(target_id, ".", name, value))
    py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonFunction. Trace:
    py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.String, class java.lang.String, class java.util.ArrayList, class org.apache.spark.api.python.PythonAccumulatorV2]) does not exist
    at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
    at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
    at py4j.Gateway.invoke(Gateway.java:235)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
    ```
    It appears the pyspark is unable to find the class `org.apache.spark.api.python.PythonFunction`. If I'm reading the code correctly pyspark uses py4j to connect to an existing JVM, in this case I'm guessing there is a Scala file it is trying to gain access to, but it fails. Any ideas?
    In an effort to understand what calls are being made by py4j to java I manually added some debugging calls to:
    `py4j/java_gateway.py`
    ```python
    command = proto.CONSTRUCTOR_COMMAND_NAME +\
    self._command_header +\
    args_command +\
    proto.END_COMMAND_PART

    print(" ")
    print("proto.CONSTRUCTOR_COMMAND_NAME")
    print("%s", proto.CONSTRUCTOR_COMMAND_NAME)
    print("self._command_header")
    print("%s", self._command_header)
    print("args_command")
    print("%s", args_command)
    print("proto.END_COMMAND_PART")
    print("%s", proto.END_COMMAND_PART)
    print(" ")
    print(" ")
    ```
    When I run pyspark shell after adding the debug prints above this is the ouput I get on a simple command:
    ```bash
    Welcome to
    ____ __
    / __/__ ___ _____/ /__
    _\ \/ _ \/ _ `/ __/ '_/
    /__ / .__/\_,_/_/ /_/\_\ version 2.0.2
    /_/
    Using Python version 3.5.2 (default, Dec 5 2016 08:51:55)
    SparkSession available as 'spark'.
    >>> sc.parallelize(range(1000)).count()
    proto.CONSTRUCTOR_COMMAND_NAME
    %s i
    self._command_header
    %s java.util.HashMap
    args_command
    %s
    proto.END_COMMAND_PART
    %s e
    proto.CONSTRUCTOR_COMMAND_NAME
    %s i
    self._command_header
    %s java.util.ArrayList
    args_command
    %s
    proto.END_COMMAND_PART
    %s e
    proto.CONSTRUCTOR_COMMAND_NAME
    %s i
    self._command_header
    %s java.util.ArrayList
    args_command
    %s
    proto.END_COMMAND_PART
    %s e
    proto.CONSTRUCTOR_COMMAND_NAME
    %s i
    self._command_header
    %s org.apache.spark.api.python.PythonFunction
    args_command
    %s jgAIoY3B5c3BhcmsuY2xvdWRwaWNrbGUKX2ZpbGxfZnVuY3Rpb24KcQAoY3B5c3BhcmsuY2xvdWRwaWNrbGUKX21ha2Vfc2tlbF9mdW5jCnEBY3B5c3BhcmsuY2xvdWRwaWNrbGUKX2J1aWx0aW5fdHlwZQpxAlgIAAAAQ29kZVR5cGVxA4VxBFJxBShLAksASwJLBUsTY19jb2RlY3MKZW5jb2RlCnEGWBoAAADCiAAAfAAAwogBAHwAAHwBAMKDAgDCgwIAU3EHWAYAAABsYXRpbjFxCIZxCVJxCk6FcQspWAUAAABzcGxpdHEMWAgAAABpdGVyYXRvcnENhnEOWCAAAAAvb3B0L3NwYXJrL3B5dGhvbi9weXNwYXJrL3JkZC5weXEPWA0AAABwaXBlbGluZV9mdW5jcRBNZgloBlgCAAAAAAFxEWgIhnESUnETWAQAAABmdW5jcRRYCQAAAHByZXZfZnVuY3EVhnEWKXRxF1JxGF1xGShoAChoAWgFKEsCSwBLAksCSxNoBlgMAAAAwogAAHwBAMKDAQBTcRpoCIZxG1JxHE6FcR0pWAEAAABzcR5oDYZxH2gPaBRNWQFoBlgCAAAAAAFxIGgIhnEhUnEiWAEAAABmcSOFcSQpdHElUnEmXXEnaAAoaAFoBShLAUsASwNLBEszaAZYMgAAAMKIAQB9AQB4HQB8AABEXRUAfQIAwogAAHwBAHwCAMKDAgB9AQBxDQBXfAEAVgFkAABTcShoCIZxKVJxKk6FcSspaA1YAwAAAGFjY3EsWAMAAABvYmpxLYdxLmgPaBRNggNoBlgIAAAAAAEGAQ0BEwFxL2gIhnEwUnExWAIAAABvcHEyWAkAAAB6ZXJvVmFsdWVxM4ZxNCl0cTVScTZdcTcoY19vcGVyYXRvcgphZGQKcThLAGV9cTmHcTpScTt9cTxOfXE9WAsAAABweXNwYXJrLnJkZHE+dFJhaDmHcT9ScUB9cUFOfXFCaD50UmgAKGgBaBhdcUMoaAAoaAFoJl1xRGgAKGgBaAUoSwFLAEsBSwJLU2gGWA4AAAB0AAB8AADCgwEAZwEAU3FFaAiGcUZScUdOhXFIWAMAAABzdW1xSYVxSlgBAAAAeHFLhXFMaA9YCAAAADxsYW1iZGE+cU1NCARjX19idWlsdGluX18KYnl0ZXMKcU4pUnFPKSl0cVBScVFdcVJoOYdxU1JxVH1xVU59cVZoPnRSYWg5h3FXUnFYfXFZTn1xWmg+dFJoAChoAWgYXXFbKGgAKGgBaCZdcVxoAChoAWgFKEsBSwBLAUsDS1NoBlgdAAAAdAAAZAEAZAIAwoQAAHwAAETCgwEAwoMBAGcBAFNxXWgIhnFeUnFfTmgFKEsBSwBLAksCS3NoBlgVAAAAfAAAXQsAfQEAZAAAVgFxAwBkAQBTcWBoCIZxYVJxYksBToZxYylYAgAAAC4wcWRYAQAAAF9xZYZxZmgPWAkAAAA8Z2VuZXhwcj5xZ00RBGgGWAIAAAAGAHFoaAiGcWlScWopKXRxa1JxbFguAAAAUkRELmNvdW50Ljxsb2NhbHM+LjxsYW1iZGE+Ljxsb2NhbHM+LjxnZW5leHByPnFth3FuaEmFcW9YAQAAAGlxcIVxcWgPaE1NEQRoTykpdHFyUnFzXXF0aDmHcXVScXZ9cXdOfXF4aD50UmFoOYdxeVJxen1xe059cXxoPnRSaAAoaAFoBShLAksASwJLBUsTaAZYJgAAAHQAAMKIAAB8AADCgwEAwogAAHwAAGQBABfCgwEAwogBAMKDAwBTcX1oCIZxflJxf05LAYZxgFgGAAAAeHJhbmdlcYGFcYJoDGgNhnGDWCQAAAAvb3B0L3NwYXJrL3B5dGhvbi9weXNwYXJrL2NvbnRleHQucHlxhGgjTcIBaAZYAgAAAAABcYVoCIZxhlJxh1gIAAAAZ2V0U3RhcnRxiFgEAAAAc3RlcHGJhnGKKXRxi1JxjF1xjShoAChoAWgFKEsBSwBLAUsESxNoBlgfAAAAwogCAHQAAHwAAMKIAQAUwogAABvCgwEAwogDABQXU3GOaAiGcY9ScZBOhXGRWAMAAABpbnRxkoVxk2gMhXGUaIRoiE2/AWgGWAIAAAAAAXGVaAiGcZZScZcoWAkAAABudW1TbGljZXNxmFgEAAAAc2l6ZXGZWAYAAABzdGFydDBxmmiJdHGbKXRxnFJxnV1xnihLAk3oA0sASwFlfXGfh3GgUnGhfXGiTn1xo1gPAAAAcHlzcGFyay5jb250ZXh0caR0UksBZWifh3GlUnGmfXGnaIFjX19idWlsdGluX18KeHJhbmdlCnGoc059calopHRSZWg5h3GqUnGrfXGsTn1xrWg+dFJlaDmHca5Sca99cbBOfXGxaD50UmVoOYdxslJxs31xtE59cbVoPnRSTmNweXNwYXJrLnNlcmlhbGl6ZXJzCkJhdGNoZWRTZXJpYWxpemVyCnG2KYFxt31xuChYCQAAAGJhdGNoU2l6ZXG5SwFYCgAAAHNlcmlhbGl6ZXJxumNweXNwYXJrLnNlcmlhbGl6ZXJzClBpY2tsZVNlcmlhbGl6ZXIKcbspgXG8fXG9WBMAAABfb25seV93cml0ZV9zdHJpbmdzcb6Jc2J1YmNweXNwYXJrLnNlcmlhbGl6ZXJzCkF1dG9CYXRjaGVkU2VyaWFsaXplcgpxvymBccB9ccEoaLlLAGi6aLxYCAAAAGJlc3RTaXplccJKAAABAHVidHHDLg==
    ro24
    ro25
    s/usr/local/bin/python3.5
    s3.5
    ro26
    ro14
    proto.END_COMMAND_PART
    %s e
    # ERROR
    # Then the error from above prints here...
    ```
    Any ideas?