Простой конвейер потока данных (Python) 2.5
Dataflow
Stamps
Initial conditions
Categories:
Активируйте Google Cloud Shell
Google Cloud Shell — это виртуальная машина, на которой загружены инструменты разработки. Он
Google Cloud Shell обеспечивает доступ к вашим ресурсам Google Cloud из командной строки.
-
В консоли Cloud на правой верхней панели инструментов нажмите кнопку «Открыть Cloud Shell».
-
Нажмите Продолжить .
Подготовка среды и подключение к ней занимает несколько минут. Когда вы подключены, вы уже прошли аутентификацию, и для проекта установлен ваш PROJECT_ID . Например:
...gcloud — это инструмент командной строки для Google Cloud. Он предустановлен в Cloud Shell и поддерживает завершение табу/p>
Вы можете указать имя активной учетной записи с помощью этой команды:
CODE...
Вы можете указать идентификатор проекта с помощью этой команды:
CODE......
Open the SSH terminal and connect to the training VM
You will be running all code from a curated training VM.
-
In the console, on the Navigation menu (
), click Compute Engine > VM inst
-
Locate the line with the instance called training-vm.
-
On the far right, under Connect, click on SSH to open a terminal window.
-
In this lab, you will enter CLI commands on the training-vm.
Загрузите репозиторий кода для использования в этой лабораторной работе. В терминале SSH Training-VM введите сл/p>
CODE......
Follow these instructions to create a bucket.
-
In the Console, on the Navigation menu, click Cloud Storage > Buckets.
-
Click + Create.
-
Specify the following, and leave the
| Property | Value (type value or select option as specified) |
|---|---|
| Name | <Project ID> |
| Location type | Multi-region |
-
Click Create.
-
If you get the
Public access will be preventedprompt, selectEnforce public access prevention on this bucketand click Confirm.
Record the name of your bucket. You will need it in subsequent tasks.
- In the training-vm SSH terminal enter the following to create an environment variable named "BUCKET" and verify that it exists with the echo command:
BUCKET="project_place_holder_text" echo $BUCKET
Copied!
content_copy
You can use $BUCKET in terminal commands. And if you need to enter the bucket name <your-bucket> in a text field in the console, you can quickly retrieve the name with echo $BUCKET.
Task 3. Pipeline filtering
...The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.
-
Return to the training-vm SSH terminal and navigate to the
/training-data-analyst/courses/data_analysis/lab2/pythonand view the filegrep.py. -
View the file with Nano. Do not make any changes to the code:
CODE...
- Press CTRL+X to exit Nano.
Can you answer these questions about the file grep.py?
Task 4. Execute the pipeline locally
- In the training-vm SSH terminal, locally execute
grep.py:
CODE...
Note: Ignore the warning if any.
The output file will be output.txt. Ioutput-00000-of-00001.
- Locate the correct file by examining the file's time:
CODE...
-
Examine the output file(s).
-
You can replace "-*" below with the appropriate suffix:
CODE...
Does the output seem logical?
...Task 5. Execute the pipeline on the cloud
- Copy some Java files to the cloud. In the training-vm SSH terminal, enter the following command:
CODE...
- Using Nano, edit the Dataflo
grepc.py:
nano grepc.py
Copied!
content_copy
- Replace PROJECT, BUCKET, and REGION with the values listed below. Please retain the outside single quotes.
CODE...
CODE...
CODE...
Save the file and close Nano by pressing the CTRL+X key, then type Y, and press Enter.
- Submit the Dataflow job to the cloud:
CODE...
Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 7-10 minutes).
-
Return to the browser tab for the console.
-
On the Navigation menu, click Dataflow and click on your job to monitor progress.
-
Wait for the Job status to be Succeeded.
-
Examine the output in the Cloud Storage bucket.
-
On the Navigation menu, click Cloud Storage > Buckets and click on your bucket.
-
Click the javahelp directory.
This job generates the file output.txt. If the file is large enough, it will be sharded into multiple parts with names like: output-0000x-of-000y. You can identify the most recent file by name or by the Last modified field.
- Click on the file to view it.
Alternatively, you can download the file via the training-vm SSH terminal and view it:
CODE......