Kafka Spark Streaming | Kafka Tutorial | Kafka Training | Intellipaat

Kafka Spark Streaming | Kafka Tutorial | Kafka Training | Intellipaat

today most of the fortune 500 companies
uses Kafka as the central platform for managing the streaming data of their
organization organization such as LinkedIn Microsoft and Netflix all
processes over more than one trillion messages in a day over Kafka so Kafka is
used by companies for creating new products responding to customer and make
business decisions in real time it is fast scalable and fault tolerant
messaging system which is used for performing real-time analytics so in
today’s session we’ll learn Apache Kafka comprehensively before going ahead do
subscribe our YouTube channel and press the bell icon so that you
never miss out our upcoming content also if you want to become a certified
professional I would like to suggest you this Apache Kafka certification and
training provided by Intellipaat so let’s have a glance and agenda we’ll
start over understanding how Kafka came into existence after that we’ll see what
is Apache Kafka then we’ll look at its architecture and see how to set up Kafka
cluster after that we’ll understand what a spark and look at its features then
we’ll see the different components of spark finally we’ll have a demo to first
integrate spark streaming with Apache Kafka then integrate flume with Apache
Kafka you can put down all your queries in the comment section and we would love
to help you out there so without any delay let’s start off with the class
let’s understand the need of Kafka the current day industry generates law of
real-time data which needs to be processed in real time so let me explain
this to you with the help of this diagram these days organizations have
multiple servers at front-end and back-end like web or application server
for hosting website or application now all of the servers will want to
communicate with data base server and thus will have multiple data pipelines
connecting all of them to the database servers you can see that the data
pipelines are getting more complex with the increase in number of systems
and adding a new system or server requires more data pipelines which will
make the data flow complicated and managing this data pipelines becomes
very difficult as each data pipeline has its own set of requirements so adding
some pipelines or removing some pipelines is difficult in such case
so this is where Kafka comes in with all the right answers to the high power
problems so Kafka basically decouples the pipelines so as you can see there is
a cluster in the center and then we have consumers which can place a request and
there are producers with send messages so the producers send messages to the
cluster from that the consumers can fetch them so this way
Apache Kafka reduces the complexity of data pipelines and makes communication
between systems simpler and manageable and it is also easy to establish remote
communication and send data across the network so you can establish an
asynchronous communication and send messages Kafka also ensures that the
communication is extremely reliable so now let’s understand the Apache Kafka
in detail so Apache Kafka is an open source distributed publish/subscribe
messaging system which manages and maintains the real-time stream of data
from different applications websites and so on so Apache Kafka or is needed at
LinkedIn and later became an open source Apache project in 2011 then in 2012 it
became first-class Apache project so Kafka is written in Scala and Java and
it is fast scalable durable fault tolerant and distributed by design
just a quick info base if you want to become a certified professional in
Apache Kafka then intellipaat offers a apache kafka certification and training
course for further details you can check the description below now let’s continue
with the session now let’s understand about various components of Kafka so we
have brokers which are basically the so is that manage and mediate
the conversation between two different systems brokers are responsible for the
delivery of messages to the right party then we have messages which could be of
any format such as string json brew and so on after that we have topics in
Kafka all the messages are maintained in what we call topics these messages are
stored published and organized in Kafka topics and then we have clusters in
Apache Kafka more than one procurve that is a set of services collectively known
as Kafka cluster now let’s look at producers so producers are the processes
that publish the data or messages to one of our topics so these are basically the
source of data stream in Kafka then we have consumers consumers are the
processes that read and process the data from topics by subscribing to one or
more topics in the Kafka cluster and finally we have partitions every broker
holds few partitions and each partition can either be a leader or a replica for
a topic all writes and reads to a topic go or other leader and the leader is
responsible for updating replicas with new data if the leader fields the
replica takes over as the new leader so now it’s time to understand about Apacha
Kafka’s architecture so the producers will send message to our topic at
regular intervals now brokers towards the messages in the
partitions configured for that particular topic and if a producer sends
two messages and there exists two partitions Kafka will store one message
in the first partition and the second message in the second partition now
consumer always subscribes to a specific topic on receiving the message consumer
sends an acknowledgment to the broker on receiving the acknowledgment the offset
is change to the new value and is updated in the zookeeper now let’s go
ahead and check out few interesting facts about the Kafka producer so
producer send records or messages to topics
producers will also select to which partition a message is to be sent per
each topic the producer could implement priority systems which is based on
sending the records to certain partitions depending on the priority of
the record so producer send records to a partition based on the records key so
they don’t read for acknowledgments from the broker and send messages as fast as
a broker can handle now let’s look at brokers so the cluster typically
consists of multiple brokers to maintain load balance the broker on receiving
messages from the producer assigns offsets to them and commits the messages
to storage on desk and the service consumers by responding to fetch request
from partitions so one broker instance can handle thousands of reads writes per
second and terabytes of messages so where backup is a point of concern for
you then let me tell you that backups of topic partitions are present in multiple
brokers so if a broker goes down one of the broker containing the back of
partitions would be elected as a leader for the respective partitions now let’s
look at messages so messages in Kafka are categorized into topics and these
topics are broken down into a number of partitions reading messages can either
be done in order from beginning to end and we can also skip or rewind to any
point in partition by providing an offset value the offset value is nothing
but the sequential ID provided to the messages so these partitions provide
redundancy and scalability so partitions can be hosted on a different server
meaning that a single topic can be scaled horizontally across multiple
servers thus enhancing the performance so the figure over here shows a topic
with four partitions with rights being appended to the end of each partition
here are record as stored on a partition either by record key the key is present
or by a round robin if the key is missing so here you can see there are
multiple brokers and there is a topic which
has four partitions so each partition has its own ID the ID of a replica is
same as the ID of the broker that hosts it and for each partition Kafka will
elect one broker as the leader supposing the replication factor of a topic is set
to 3 when Kafka will create three identical replicas of each partition and
place those replicas on available brokers in the cluster now let’s look at
the working of consumers so the consumer can subscribe to one of our topics and
reads the messages in the order they were produced it keeps track of the
messages it has already consumed by keeping the track of the offset of
messages so messages with same key basically arrive at the same consumer so
the consumers work as part of a consumer group which is one or more consumers
that work together to consumer topic the group assures that each partition is
consumed by only one member so as the figure shows there are three
consumers in a single group consuming a topic and two consumers are working on
one partition each while the third consumer is working on two partitions
now let’s look at the zookeeper so zookeeper is an open source Apache
project that provides centralized infrastructure and services that enables
synchronization across an Apache Hadoop cluster so developed originally at Yahoo
zookeeper facilitates synchronization among the process by maintaining the
status on zookeeper servers that stores information in local log files and these
zookeeper servers are capable of supporting a large Hadoop cluster so
Kafka brokers coordinate with each other using zookeeper producers and consumers
are notified by the zookeeper service about the presence of new broker in the
system or about the failure of the broker in the system so this is why
zookeeper is really important for Kafka just a quick info base if you want to
become a certified professional in Apache Kafka
then intellipaat offers Apache Kafka certification and training
for further details you can check the description below now let’s continue
with the session now let me help you in understanding a single broker setup
configuration and Kafka with the help of a demo so for this demo you would need
to have Java Kafka and zookeeper in your system already your first step would be
to open the terminal and start the zookeeper and the Kafka broker so this
is the command to start the zookeeper zookeeper so we’ll start dot Sh after
that I’ll give the path Kafka config zookeeper dot properties so we are starting zookeeper now what
I’ll do is I’ll implicate the session and in this new session I will start Kafka now
and this is the command to start Kafka Kafka server star dot SH and this is the
part Kafka conflicts order properties I’ll hit on enter
and we are starting Kafka now now I’ll open another terminal now let me type
GPS to check if Kafka and zookeeper a running or not
right so coram Pio means so this tells us that zookeeper has started and Kafka
tells us that Kafka has also started right so now since we have started Kafka
and zookeeper it’s time to create the Kafka topic so this is the command to
create Kafka topic Kafka topics dot SH and then we’ll give double – and
given the command create after that we’ll type do keeper and given the port
number over here a local host to one eight one and then we’ll set the
replication factor which is one of that value of the number of partitions which
is one and then we’ll give topic and give the name of the topic so I am
setting the name of the topic to be my topic one right so we see that we have
successfully created the topic my topic one so now it’s time to start the
producer to send some messages and this is the command to start the producer
Kafka console producer dot SH will give double – broke a list and then we’ll set
the localhost and given the port number after that we’ll give topic and give the
name of the topic which is my topic one I’ll hit on enter
and over here we can give the set of messages so let me just type hi how are you right again I will duplicate
the session and this is the command to start the consumer Kafka console
consumer dot Sh bootstraps over localhost and then I get the port number
after that topic and get the name of the topic and then type from beginning I’ll
hit on enter right so we see that the messages which
we had sent from the producer we get them in the consumer now let’s see I
type something else over you Spada hello world now let me open the consumer
and see right so spot our hello world let me write some more messages through
the producer so I’ll just type some random words over here
random words and again we see that we’ve got those
messages in the consumer now let us understand about spark briefly so spark
is a cluster computing framework for real-time processing it was introduced
as Hadoop sub-project in the UC berkeley R & D lab in the year 2009 and
became open source in 2010 and was finally donated to Apache Software
Foundation in the year 2013 so spark provides an interface for
programming all the clusters with implicit data parallelism and fault
tolerance now let us look at some of the features of spark so spark provides a
real-time computation and low latency because of in-memory computation and
spark as hundred times faster for large-scale data processing and spark is
also polyglot so you can write spark applications in multiple languages such
as Java Scala Python r and SQL spark also has powerful caching so it has
simple programming leo which provides powerfully caching and desk persistence
capabilities and spark also provides multiple deployment modes so it can be
deployed through me Zeus Hadoop our yarn or sparks own cluster manager now spots
impact was such that from small startups to Fortune 500 almost every single
company started adopting apache-spark to build scale and innovate their
big data applications industries like media health care finance ecommerce and
travel almost every single industry is using spark intensively so now let’s go
ahead and understand the concepts of re recent spark so when it comes to
processing the data over multiple jobs we need to reuse and share the data
which can be achieved through in-memory data sharing which is actually faster
than Network and disk sharing this is where rdd’s come in to help us with
in-memory data sharing so rdd stands for resilient distributed
data set and it is the fundamental data structure of apache spark
so by resilient I mean fault tolerant as it can recompute missing or damaged
partitions in use of a node failure with the help of
our DD lineage graph and it has distributed since data resides on
multiple nodes and finally data set represents records of the data you work
with so the user can know the data set externally which can be either JSON file
CSV file txt file or a database so now let us understand the various components
of spark ecosystem let me start with spark code component which is the most
vital component of spark ecosystem it is responsible for basically function
scheduling monitoring and so on so you can see that the entire spark ecosystem
is built on top of it then we have different deployment modes spark can be
deployed through via yarn mesos or spark own cluster manager then we
have different libraries so the spark ecosystem library is composed of spark
sequel and the lab graphics and streaming so spark sequel helps us in
performing queries on data and stored data using sql-like queries then we
have MLM the spark machine learning library eases the deployment and
development of scalable machine learning pipelines like summary statistics
correlation feature extraction and many others and graphics component of spark
helps the data scientist to work with graphs and non graph sources to achieve
flexibility and resilience and graph construction and transformation then we
finally have the spark streaming component which allows us to perform
batch processing and streaming of data in the applications coming to the
programming language’s spark can be implemented in scala R Python and Java
however Scala is the widely used language for spark and finally we can
store data over HDFS local file system and cloud
it also supports sequel and no sequel databases so now you might have got
brief idea about spark components but let us discuss trimming in detail as
will be performing the demo on Kafka spark streaming towards the end of a
session so data streaming is a technique for transferring data so that it can be
processed as a steady and continuous stream a data stream is an unbounded
sequence of data arriving continuously there are so many users out there using
previous streaming sources like YouTube Netflix Facebook and Twitter and all
these sources produce live streams of data thus streaming technologies are
becoming increasingly important with the growth of the internet so spark streaming is used for processing real-time streaming data it enables high
throughput and fault tolerant stream processing of live data streams and with
spark streaming we can perform our DD transformations on many parts of data so
the fundamental stream unit here is known as B stream which is basically a
series of rdd’s to process the real-time data so now that you understood what
exactly spark streaming us let’s have a look at some of the features of spark
streaming so spark streaming a scalable let’s say you start processing with a
single node but then with the increase of data you can add more nodes as and
when required spark streaming is also very fast and helps in achieving low
latency another feature of sparse streaming is that it provides fault
tolerance whereas if there is any failure or error during streaming
process it can be handled without any loss of data spark streaming can be
easily integrated with both batch processing and real-time processing so
now let me brief you about how spark streaming works so as you can see here
in the diagram spark streaming receives live input data streams from various
sources it divides the data into multiple batches which are then
processed by the spark engine and generates the final stream of results in
batches so now well implement a demo on how to integrate Kafka with spark
streaming so in this demo we are going to fetch data from Kafka topic to a
spark app so let me actually show you guys the folder where a spark
application is present so I have created a folder with an in-car spark streaming
and the anti-spark application would be in this folder so let me hit LS and
let’s see what do we have so we have bill dot SBT over here which basically
has all the dependencies to run the code and build a jar file and inside this SRC
folder they’ve got the code file for the spark application so let me go into this
SRC folder I had Ellis so inside this we have the
main folder again let me go inside that see you mean
and inside the main folder we have the scholar folder so I will go inside that
to CD scholar and inside this we finally have our scholar program Kafka our dots
Cara so let me open it v/i kafka r dot scallop
so this is our program for a spark application right and over here in the
program if you can see I have given the name of the topic to be my test so later
on when I’m creating a topic I would have to name the topic to be my test so
now that we’ve seen this folder which has all the spark code let me go ahead
and start through keeper and Kafka so now let me put in the command to
start zookeeper so it’ll be sooo keeper so we start dot assets and then I’ll
give the path Kafka slash config slash through keyboard of properties so we are
starting zookeeper now I’ll duplicate the session and I’ll also start Kafka so
this will be the command to start Kafka Kafka so we’ll start dot Sh and this is
the park Kafka slash config slash server lot properties I’ll hit enter and we see
that Kafka is also starting now again let me duplicate the session so let me
type GPS to see if Kafka and zookeeper have started or not right so we see that
quorum pure main this tells us that zookeeper has started and we have Kafka
here which still says that Kafka also has started so now it’s time to create a
topic with the same name which we had provided in the program right so this is
the command to create the topic Kafka topics dot Sh
I’ll give the create command and then I’ll type zookeeper of that I’ll give
localhost and then set the port number and then I will give the replication
factor which is one after that I got a number of partitions which is also one
of that I’ll type topic and give the name of the topic which is my test so we have successfully created this
topic my dust now it’s time to create the producer to pass on some messages
and this is the command to start the producer cough car console producer dot
sh broke a list and then we’ll type localhost and give the port number of
that will type top it and give the name of the topic which is my dust alright so
we have also started the producer and then we can start giving in the messages
all right are you so now what I’ll do is I’ll
create a duplicate session and in the duplicate session I will compile and run
the PARCC session so let me go to the Kafka’s part
streaming directory CD car spark
streaming so let me see what do we have over here
so we have the build on SPD file live monies and the SRC file so now I’ll
compile this PD file so let me just type as PT compile right so we see that done compiling this
means that we are successfully compiled the dot SBD file now let’s also run this
let me give in the command SBT run so we have successfully started this
session so now what I’ll do is I’ll pass in some messages through the producer
and we’ll get all of those messages in this park session with real-time over
here right so Sparta this is part of right so we see that these messages
which I’m sending through the producer we get this in the spark session via
Sparta this is part up again let me also send some random messages right so we get all of these random
message 1 2 & 3 audio so this is part of integration and we get this and a
session over this is Park after integration right guys so this is how we
can integrate spark with kafka hi everyone welcome to the session of pasta
dough base agenda we will see what is caused by integration and what is the
need of this integration after that we will have a small introduction with a
passively simply set up a standalone flume more and we have a small
demonstration of flow and after that we will integrate flume and Gotha
so that Kafka will generate some messages those messages will pass so
slow and finally get loaded into hv l 6 so let us start with what is pasta
integration tough my integration is nothing but we combine the features of
pasta with any another distributed system so that we can enhance and get
the functionality of who distributed systems at a time for example we can
integrate our cop car with flume and it is easy way to load your data from CAFTA
as as a source and then load that into but do apart from that you may also get
a need to process your data in real time in that case we will use pasta with
spark and strong traffic will produce the messages those messages will go to
spark or strong there the real-time analysis will happen and finally we will
send that processed data to some third particle you can say the fourth third
particle can be a dupe or any another no sequel database now let us have a small
introduction to Apache flow a party fume is a distributed reliable
easy to use flexible tool which helps receiving fast data loading of huge
datasets from various sources who the sinks by the mean of a channel this
means that we are going to have a source you can see a file which will pass
through a channel the example of channel is MND and finally that data will be
stored in to sync the example of sync can be anything like your HDFS or any
another file or any other directly the guest is done we will run a service
called flume isn’t which is nothing but a JVM with internally and as the three
components of flue we do not need to do anything else we will simply put power
power ball to some node and we will run the fluid agent which will have three
inbuilt components that are sources sinks and channels and we will
completely process our data from these three components now as I said the three
components are sources sinks and channels first luminescent you can have
multiple sources multiple sinks and multiple channels so what are sources
the process which will produce the events the examples are netcat source fo
source HTTP source it can be anything can be a file from where the data is
being continuously being updated and you need to move that data from that
particular file to any another location sinks which receives the events it can
be logger sync which means that if you want to display the search results and
such events on the console itself then we will use blogger sync then SPFs even
sync which will be used when you want to store your data in check into HDFS and
rolling file sync means you are going to store the data in a file which is
complete continuously getting updated and rolling over after a particular time
or after a particular size channels which connects sources and syncs the
examples of memory channel the JDBC channel the disk general or we can say
the file channel at the bottom of this PPT you can see we are generating the
data from your servers which will directly go to some source ok pass
through a channel say memory and then finally it will pass through the sink
and get stored into HDFS so the flume agent will consist of these three
components source channel and sink now what is an agent
so whatever the messages that are being generated from the source for example
say web server which they all will be treated as events you can simply say
this is a message and in terms of snow we will call it events so it is nothing
but a unit of data it can be anything like text image ever file acceptor data
from the source to sins will flow as an pivot when we will do the demonstration
I will show you on the flow agent you can see event generated and finally the
node defines size we can tell the event of any size the size will be totally
depend on the channels sites for example if we
are using in memory channel and we have a fixed amount for for the JVM like say
2 gigs so in that case we will simply generate the event still to sign in to
geeks or we can say the 80% of cookies and then after that basic data will be
sync to the disk or the any another source now understanding the
consideration so before running the fume isn’t we will simply have to setup the
pons property which will be present at conferences if you his name will be fume
– point got properties this property file B will be used when you run your
flow isn’t you can simply customize this as well say you will just created a file
with name exercise 1.1 so you can simply use that when you run you up you is it
in that file we will simply define what are the what is the soap isn’t in what
is the source name what is the sink name what is the channeling and apart from
that we will define some properties of these three components so let us have a
simple bone file and I will explain what are the different components we do happy
so on the screen you can see the first one is name components which means that
we are naming our flow components here a1 is the our agent so it is the agent
name and the three component sources sinks and channels we are providing some
names to them so for sources we have the name source 1 4 6 we have the name sinks
1 sorry sink 1 4 channels we have the name channel 1 after that you can see
the configures source we are configuring this source 1 now a 1 dot source is your
source 1 that is taken from here only and what type we will define that type
what is the type from where we are getting the data it can be for example
it can be netiquette which is nothing but a server which will listen to some
of the boat and produce some data the next one is to define the channel of the
source how we will process that beta so here it is a 1 dot sources dot source 1
both channels is equal to channel 1 channel 1 is this thing the name of our
channel now configure the sink a 1.6 dot sink 1 this we have taken from here only
even Bob sinks in which name is sink so we have taken this phone here only
doctype we provide that type 3 the type of sink it can be HDFS or it can be a
sequel file or any mouthful database like my couch based Cassandra HBase
anything we will define the channel for this thing as well they want dot sinks
both sink one your channel is equal to channel 1 which is the name of our
channel we finally we will configure the channel the type of the channel in our
case we will be using in memory Channel ok so this was how a simple
configuration file will look like now to set up your Apache flume you simply need
to download that from Apache website or you can simply W get this at your Linux
machine I have given the path here so you can simply use that after that be
placed at home to your limits machine simply under this Papa Jeff gustavo and
then set the java home and flume home in dota bash RC file let us complete this
part on Linux machines this is my Linux machine let me change the phone force okay so I have already downloaded the
Apache a flume turbo here I will simply enter this I got this file this directory
I’m simply moving this simply renaming this for our Panamanian so that whenever
we want to change the directory to a budget film we can simply type the
simple name instead of this pathway – boom – okay moving into this is our zoom
directive that we got from entering the table inside that you can see an
important directly that is been where all these scripts will be there and the
second important directory is bones where the actual configuration files
will be present we can customize our configuration for our human agent and
place that config that customized configuration file into this direct let
us have a look what is inside this you can set the go for the environments and
this is the copy that the file that I was talking about flow – properties okay
so then we will do a demonstration we will create our own con file and place
that into this place okay after that let us set be the plume bomb and Java home
invasion so I have done that already you can see
this is my plume oh and this is my go you can simply set both of these things
save that now come on so that the Linux machine come to know we have some new
changes in the chassis file okay till now we have just get the tarball offload
and a target and set up the environmental variables let us move to
the slides let us move on small exercise what we
will do we will generate some data using netcat okay which is nothing but a
server which will listen at some code and we can simply type our message
they’re the same messages will pass through the memory sing make sure in the
memory channel and get stored to some synched location so in our case we will
use the control itself as a sing-off li so we will see all the messages on the
console itself we will do the configuration changes which will define
the parameters for all three flume components okay after setting the parameters in in that
configuration file consider we are we will create a one conciliation file that
is exercise one both bones inside that we will define the three components like
sources agents and channels for suppose if you okay and after that we will
simply run the flow agent the next step we will generate the messages using
Napster and finally check the same messages events at the flume agent
console this is what we go we are going to do now okay first of all let us go to the Linux
machine I have already created one file that is exercise one more pawns I will explain what exactly
the configuration we have explained here so at the first place you can see
configuring components we have a agent and its name is the a1 and the phlume
components like sources channels and sinks we are giving some names to them
like our sources name is source 1 channels name is general one sinks name
is sink long now we will configure the source from where the data will be
generated in this case we need to define that type okay a 1 dot source is both
source 1 which we have taken from here only after that we will simply type dot
type what is the type of the sauce we are generating data from netcat so we
just simply write here net what is the channel channel we are going to use this
the channel name is channel 1 so we simply typed it till now we need to type
the IP of net gets over this local server so I have paste in my a local I
behaved and the post that we are going to use in the both is you can use any
pork here now configuring the channel the type of channel is memory the
capacity is ten thousand bytes and the transaction capacity means is one
thousand means whenever we will have the details from 1000 bytes that data will
be be persisted to the section this thing or in our case of the sinking the
control only so the in that case the data will simply keep rolling over the
console now configuring sync look at this thing look the type of thing
Logan louver means that we are going to simply store the sources messages or
ball forces control itself moviments will be the
channel we are going to use channel one that we define game so this is a simple
configuration for our exercise okay you can simply use this exercise when they
open from here also you can simply put this earth there loans come directly and
putting this into films both compare agree okay let us move to the slides to
the slide you can see I have pasted the spin shape for exercise 1.4 and I have
defined I mean highlight it important things here okay these are the same
things that I just explained so I’m moving ahead now what is the task we
will need a phloem isn’t providing the configuration file to it
okay we will use the name of the agent that we define in our exercise
wonderpants and finally we want phloem dot robot
lower is equal to info OMA console I want all the results on the console
itself whatever loom is doing I need to see all the other information on the
console so in our case what we will do we will run the phloem in G which is our
agent we will give the process name is agent I fancy it will tell us where of
the configuration directly so we are going to use completely of flume – have
what is the configuration file we are going to use exercise one book on fear
what is the name of the agent that we define in exercise 1 dot pump it is a
one and finally we want all the results under control let us run this I’m going
to simply take this and open it okay for flow Mason is
running it is displaying whatever it is doing on the conformance as it starts
let me open in another window I have already opened that where we will run
the net can’t so okay so before we start generating messages let us see what the
flume agent has done giving you can see starting channel and the channel name is
channel one okay after that you can see channel name general one started now it
is same starting sink sink one was the name of our sink
starting source source one was the name of our source okay we have started
whatever we have given in the configuration file now it’s time to
generate messages via net get server the command is netcat then the IP name
let me take the ID name here that we gave in our exercise window punch OH –
ye – get the ID this is my IB I will simply run netcat IP and the PO that I
use in the country in the exercise window I use okay vector command is NC naught M K you
can see the IP and the pole okay now our local server is running
we’ll simply type some messages here and the same messages will go through the
memory channel and will be displayed on the flow measured console okay now I’m
typing some messages here this is my first event okay
it means the message has passed to the the in memory channel and we should see
the same message on low is it let us check this on the flown window you can
see you donated an event this is the ASCI a col and this is the actual
message okay let us generate some more messages this is my second message of
event we take you and the same messages we should see on the apache isn’t
console window you can see here of the messages this is my second message and
we are testing flow okay so on the console it will simply we must show our
complete message but it is continuously storing all the messages here so this is
how we can test our setup and test our flow agent okay so I am just closing
this now and I’m closing this sling agent as well I need to kill this so now we have tested how we can set up
the flume peasant and then can run a simple demonstration to check if our
messages are being generated openly and they are being passed from the from the
memory channel to the sink let us move with the I had with the slides now okay
we have we are done till here we have generated some events using MC command
giving the IP and port after that we validated the same messages we were
generating at netcat and we were getting the same messages on the loom agent
console now now it’s time to integrate our pasta with Apache flow we can simply
take tough ha as a source or you can use staff as a channel and we can use also
use pathway as a sink in today’s session we are going to simply use the Phi as a
source you will generation where messages wire after control producer and
both messages will pass through in memory Channel and finally will be
stored to sync okay in our case the sync will be HD edges we are going to load
data into HDFS form half as produces okay for that what we need we should
have path pop and running we should have Hadoop services up and running so that
we can store our data into Hadoop and then we will turn the flume agent after
repairing one configuration files that will be used while we run the flume
agent and this configuration file will contain all the information about sing
about in memory channel about the forces which is tough find our case so let us
first start the Hadoop services and TAFTA services on this PPT I have
mentioned how you can start the kafka services so before starting Gasper you
need to start jookiba then you can start server and then create a test popeck so
that we can use that topic as a source whatever data
will be produced in the topic that will be passed will pass through flume to
sd-6 okay let us start the census first I have a small slip services the asset
which is doing nothing but starting name mode they denote the source manager nor
manager and if discipline I am going this it is starting name mode now data
node now the host manager mode manager and finally the species we can simply
put GPS to check if all services are up and running so processes are so
initially the name of machine will be in safe mode so I am going to enter a
command that will take name mode out of safe mode okay so now our Hadoop is ready we can
simply write our data to Hadoop now it’s time to run Kappa but before that we
need to run zookeeper so this is the command I’m going to run
this in background zookeeper – sir / – start opposite and we are passing Lee
using the default UV / properties from craft by itself it will enter and now I
will start off the server as well you this is the command which is presented
off buzz bin directory cause / – server – starred opposite we are passing the
property file here running this back down now you can simply do JB’s and you
can see of also running and also now we will create one sample topic in tough
spots so that we can use that topic to generate some messages and then pass
those messages to flow the command is okay so this is traffic flow and giving
the topic name is kasper Sluman creating this okay create a topic – gah – flow so
we have started Hadoop services after services and we have created one new
topic named pasta – zoom we will use this topic to produce some messages and
they both messages will be stored in this topic okay now it’s time to date a
flume agent configuration file that will include all the details regarding your
source your sink and your channel I have already prepared one exercise file that
I am a myth exercise – okay so let me explain what are the different
components that I have used here first one is exactly similar that we used in
more exercise won’t we have some agent name that is a one and we have given the
names to all three apart a few components sources name its source one
channel’s name is general one thing’s name its Singam
first of all we are configuring source in our case the forces kappa so the type
will be this one who are they don’t Apache log
flume God sauce for pasta Boca so in previous example it was simply napkin we
also need to provide the details that are related to pasta so that we can
properly fetch our data from Casper topic to flow we need to give the
zookeeper details zookeeper connect in our case it is running on localhost and
the both use this to one get one we need to give the topic name which is pasta
new in our case I will simply update this point okay so in our case the copic was passed
by – flu again we can simply give a flu my ID
sorry a group ID we can give any name here then the channel journal is channel
one that we have defined here already these are some of the interpreter and
optional commands that you can use because I am going to store my data with
dead in SPFs that’s why I have used these pipes okay the interpreting filter
types now the second thing is configuring channel we are using in
memory channel we have given the capacity and projection capacity as well
just like we did this in the exercise 1 then we will configure the sink
earlier our sink was the console this time we are going to store our data in
HDFS channel use this channel 1 we have defined this here and we are going to
store this put this s DF s path M then pop it hoping will be created whatever
we have given here for example Casper – snow that they will be graded in SPSS
and the date we are going to store the data dead points apart from that we have
given some optional parameters as well whenever whole interval is 4 to 5
seconds number of seconds to wear before holing current file if we need to create
a new file it will wait for 5 seconds all size after size of 1 KB file we will
be getting a new file all contents if we generated 10 events to a particular file
we will get a new file after that the last wave file type in data stream when
you will move data from pasta source tools to stay fresh wires know the first
thing is that it will generate the data in sty format only so to get the data in
that format you will need to define this data stream otherwise you will get 5
values let us save this I am again going to
copy the same clothes home directory yeah now I have explained all these
items how to configure components how to configure source then Channel and
finally the same now it’s time to start the flow agent using this configuration
file that we just created after that we will start transfer control producer
door assets so that we can pass our messages to the newly edited topic that
we using exercise to the pumps after that we will check the same messages in
HDFS they will be generated by the name of topic itself finally we will check
the contents of the final file let us do it after all I need to stop the flow agent okay we see the command
exactly same that we were using earlier I’m using just exercise two here okay
and new agent will be started it will get the configuration form exercise to
that pong file okay meanwhile it is starting
let me just prepare the Kafka producer as well okay this is the command pass /
– console – Prabhu sir don’t assess this is the script that will be present had
tons of pasta bin directory I will use the bow condition I am going to generate
this messages on this particular poker itself and the topic name is flow in
Arcis I’m going to donate some messages in this public okay from this side flume
agent is completely ready to receive messages
so before I generate messages wire producer let me show you what this has
done already it has started the same channel and all yeah okay you can see here creating channels
channel channel one type memory we are using in memory channel it has created
that beauty channel channel 1 then it is creating this source source 1 and the
type is tough fossils okay and after that it is creating the sing the name of
the sink is sink 1 and type is HDFS so lumijean has done all the settings now
it’s time to generate the messages I’m on hospitalization and I’m going to
type some messages which are going to store in BAFTA slim public this is my of
objects even let us see if the same message has been passed through in
memory channel and store in to fetch vehicles let us take this on the and
fall first okay you can see first of all it will
create a temp file for every message that is being updated in that topic and
then create the main data file with the name this we have given this path that
we want to store data in temporary of HDFS and the name of the directive we’re
hoping will store the data is passed by hyphens no which is exactly same as the
name of our topic then we choose that we want to store the data day twice this is
the date and this is the main data file let me generate two or three more
messages and then we will close this and finally check the data in HDFS Excel
this is my second message this is my okay so so my dad is continuously
writing all of these newly updated data into HDFS you can simply see opening
renaming and whiting by touch all bat-pole means writing to a stasis is
complete simulated in for all the five messages that we recently generated in
the – poppet so we’ll crash – flume public okay now as we are done with the
generating messages were one acosta provision those producers messages have
been passed to the in member shanem of flume and finally get stored into H
therefore let us check these final results in results into HDFS itself
first of all let us close the producer then close the indecent as well with APs
and kill the application that was running for young agent okay now I’m
going to go into HDFS I do preface – LS m slow this is our this was our topic
name let us see what what are the different is that we’ll leave it here we
want it to store the data in via dates oh I think I take only given him off by
here so plume has donated this directly okay
so now we can see do have multiple data files at this location all the data that
was generated via the console solutions was continuously being stored at this
location let us get this data file okay this is the first message that we
donated right on the body side window you can see this is calf part optic
caiman this is my topper top experient you can
see the same as in here please file similarly you can check any of the files so basically it will contain all the
messages that we were generating in Kafka producing so that is all with the
integration part of pasta with flume and here we were using Kafka as a sauce only
just a quick info base if you want to become a certified professional in
Apache Kafka then in telepath offers appacha Kafka certification and training
course for further details you can check the description below so this brings us
to the end of this session if you have any queries do comment below we’ll reach
out immediately thank you and see you again in the next session

4 thoughts on “Kafka Spark Streaming | Kafka Tutorial | Kafka Training | Intellipaat”

  1. ๐Ÿ‘‹ Guys everyday we upload in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel๐Ÿ‘‰( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.๐Ÿ™‚

  2. Guys, which technology you want to learn from Intellipaat? Comment down below and let us know so we can create in depth video tutorials for you. ๐Ÿ™‚

Leave a Reply

Your email address will not be published. Required fields are marked *