Data Persistence
data
persistence
persistence
refers to object and process characteristics that continue to exist even after
the process that created it ceases or
the machine it is running on is powered off. when an object or state is created and needs to be
persistent, it is saved in a non-volatile storage location , like a hard drive
, versus a temporary file or volatile
random access memory
persistent
data
· The opposite of
dynamic—it doesn’t change and is not accessed very frequently.
· Core information, also known as
dimensional information in data warehousing. Demographics of
entities—customers, suppliers, orders.
· Master data
that’s stable.
· Data that exists
from one instance to another. Data that exists across time
independent of the systems that created it. Now there’s always a secondary use
for data, so there’s more persistent data. A persistent copy may be made or it
may be aggregated. The idea of persistence is becoming more fluid.
· Stored in actual
format and stays there versus in-memory where you have it once, close the
file and it’s gone. You can retrieve persistent data again and again. Data
that’s written to the disc; however, the speed of the discs is a bottleneck for
the database. Trying to move to memory because it’s 16X faster.
· Non-volatile. Persists in
the face of a power outage.
· Any data stored
in a way that it stays stored for an extended period versus in-memory data.
Stored in the system modeled and structured to endure power outages. Data
doesn’t change at all.
· Data considered durable
at rest with the coming and going of hardware and devices. There’s a
persistence layer at which you hold your data at risk.
· Data that is set
and recoverable
whether in flash or memory backed.
Data
In
computing, data is information that has been translated into a form that is
efficient for movement or processing.
Database
Database
is a systematic collection of data. Databases support storage and manipulation
of data. Databases make data management easy.
The
term database server may allude to both equipment and programming used to run a
database, as indicated by the specific situation. As programming, a database
server is the back-end segment of a database application, following the
customary customer server display. This back-end parcel is once in a while
called the example. It might likewise allude to the physical PC used to have
the database. At the point when referenced in this unique situation, the
database server is normally a devoted higher-end PC that has the database.
Note
that the database server is autonomous of the database engineering. Social
databases, level records, non-social databases: every one of these designs can
be obliged on database servers.
Database
Management Systems (DBMS)
A
database management system is system software for creating and managing
databases. the DBMS provides users and programmers with systematic way to create , retrieve,
update and manage data.
File
vs Databases
File - A file is a container in a computer system
for storing information. Files used in computers are similar in features to that of paper
documents used in library and office files. There are different types of files
such as text files, data files, directory files, binary and graphic files, and
these different types of files store different types of information. In a
computer operating system, files can be stored on optical drives, hard drives
or other types of storage devices.
Pros of the File
System
· Performance can
be better than when you do it in a database.
· To justify this,
if you store large files in DB, then it may slow down the performance because a
simple query to retrieve the list of files or filename will also load the file
data if you used Select * in your query. In a files system, accessing a file is
quite simple and light weight.
· Saving the files
and downloading them in the file system is much simpler than it is in a
database since a simple "Save As" function will help you out.
Downloading can be done by addressing a URL with the location of the saved
file.
· Migrating the
data is an easy process. You can just copy and paste the folder to your
desired destination while ensuring that write permissions are provided to your
destination.
· It's cost effective in most cases
to expand your web server rather than pay for certain databases.
· It's easy to
migrate it to cloud storage i.e. Amazon S3, CDNs, etc. in the future.
Cons of the File
System
· Loosely packed. There are no
ACID (Atomicity, Consistency, Isolation, Durability) operations in relational
mapping, which means there is no guarantee. Consider a scenario in which your
files are deleted from the location manually or by some hacking dudes. You
might not know whether the file exists or not. Painful, right?
· Low security. Since your
files can be saved in a folder where you should have provided write
permissions, it is prone to safety issues and invites trouble, like hacking.
It's best to avoid saving in the file system if you cannot afford to compromise
in terms of security.
Database
- as I mentioned earlier database
is collection of data..
Pros
of Database
· ACID consistency, which includes
a rollback of an update that is complicated when files are stored outside the
database.
· Files will be in
sync with the database and cannot be orphaned, which gives you the upper
hand in tracking transactions.
· It's more secure than saving in
a file system.
Cons
of Database
· You may have to
convert the files to blob in order to store them in the database.
· Database backups
will be more hefty and heavy.
· Memory is
ineffective.
different
arrangements of data
Big
data
Big
data is a blanket term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large datasets.
While the problem of working with data that exceeds the computing power or
storage of a single computer is not new, the pervasiveness, scale, and value of
this type of computing has greatly expanded in recent years.
An
exact definition of "big data" is difficult to nail down because
projects, vendors, practitioners, and business professionals use it quite
differently. With that in mind, generally speaking, big data is:
· large datasets
· the category of
computing strategies and technologies that are used to handle large datasets
In
this context, "large dataset" means a dataset too large to reasonably
process or store with traditional tooling or on a single computer. This means
that the common scale of big datasets is constantly shifting and may vary
significantly from organization to organization.
Data
warehouses
Data
warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous
sources that support analytical reporting, structured and/or ad hoc queries,
and decision making. Data warehousing involves data cleaning, data integration,
and data consolidations.
Information
gathered in a warehouse can be used in any of the following domains –
· Tuning
Production Strategies − The product strategies can be well tuned by repositioning
the products and managing the product portfolios by comparing the sales
quarterly or yearly.
· Customer
Analysis − Customer analysis is done by analyzing the customer's
buying preferences, buying time, budget cycles, etc.
· Operations
Analysis − Data warehousing also helps in customer relationship
management, and making environmental corrections. The information also allows
us to analyze business operations.
Functions of Data Warehouse Tools and Utilities
·
Data Extraction − Involves gathering data from multiple
heterogeneous sources.
·
Data Cleaning − Involves finding and correcting the errors in
data.
·
Data Transformation − Involves converting the data from legacy
format to warehouse format.
·
Data Loading − Involves sorting, summarizing, consolidating,
checking integrity, and building indices and partitions.
·
Refreshing − Involves updating from data sources to
warehouse.
different
types of databases
five
types of databases.
Relational
Database
The
relational database is the most common and widely used database out of all. A
relational database stores different data in the form of a data table.
E.g.:
- SQLite, MySQL, PostgreSQL, Oracle DB
Operational
Database
Operational
database, which has garnered huge popularity from different organizations,
generally includes customer database, inventory database, and personal
database.
Data
Warehouse
There
are many organizations that need to keep all their important data for a long
span of time. This is where the importance of the data warehouse comes into
play.
E.g.:
- Redshift, BigQuery, Snowflake, MarkLogic, Oracle, Amazon RedShift
Distributed
Database
As
its name suggests, the distributed databases are meant for those organizations
that have different workplace venues and need to have different databases for
each location.
PostgreSQL,
Oracle
End-user
Database
To
meet the needs of the end-users of an organization, the end-user database is used.
NoSQL
database
NoSQL
encompasses a wide variety of different database technologies that were
developed in response to the demands presented in building modern applications:
Developers
are working with applications that create massive volumes of new, rapidly
changing data types — structured, semi-structured, unstructured and polymorphic
data.
Long
gone is the twelve-to-eighteen month waterfall development cycle. Now small
teams work in agile sprints, iterating quickly and pushing code every week or
two, some even multiple times every day.
Applications
that once served a finite audience are now delivered as services that must be
always-on, accessible from many different devices and scaled globally to
millions of users.
Organizations
are now turning to scale-out architectures using open software technologies,
commodity servers and cloud computing instead of large monolithic servers and
storage infrastructure.
Relational
databases were not designed to cope with the scale and agility challenges that face
modern applications, nor were they built to take advantage of the commodity
storage and processing power available today.
E.g.:
ArangoDB, BaseX, Clusterpoint, Couchbase, CouchDB, DocumentDB, IBM Domino,
MarkLogic, MongoDB, Qizx, RethinkDB
Centralized
Database
A
centralized database (sometimes abbreviated CDB) is a database that is located,
stored, and maintained in a single location. This location is most often a
central computer or database system, for example a desktop or server CPU, or a
mainframe computer. In most cases, a centralized database would be used by an
organization (e.g. a business company) or an institution (e.g. a university.)
Users access a centralized database through a computer network which is able to
give them access to the central CPU, which in turn maintains to the database
itself.
Cloud
Databases
A
cloud database is a type of database service that is built, deployed and
delivered through a cloud platform. It is primarily a cloud Platform as a
Service (PaaS) delivery model that allows organizations, end users and their
applications to store, manage and retrieve data from the cloud.
E.g.:
Amazon Web
Services, SAP, EnterpriseDB, Garantia Data, Cloud SQL by Google
data
warehouse vs big data
|
BASIS
FOR COMPARISON
|
DATA
WAREHOUSE
|
BIGDATA
|
|
MEANING
|
AN
ARCHITECTURE , NOT A TECHNOLOGY .IT
EXTRACTING DATA FROM VARITIES SQL BASED DATA SOURCE (MAINLY RELATIONAL
DATABASE) AND HELP FOR GENERATING
ANALYTICAL REPORTS.IN TERMS OF
DEFINITIONS, DATA REPOSITORY , WHICH USING FOR ANY ANALYTICAL REPORTS , HAS
BEEN GENERATED FROM ONE PROCESS, WHICH IS NOTHING BUT THE DATA WAREHOUSE
|
Big
Data is mainly a technology , which stands on volume, velocity , and variety of the data. volumes defines the
amount of data coming from different sources , velocity refers to the
speed of data processing , and varieties refer to number of types of
data(mainly support all type of data format)
|
|
Preferences
|
If
an organization wants to know some informed decision (like what is going on
their corporation,next year planning based on current year performance data
etc) they prefer to choose data warehousing , as for this kind of report they
need reliable or believable data from the sources)
|
If
organization need to compare with a lot of big data, which contain valuable
information and help them to take a better decision (like how to lead more revenue,
more profitability , more customers etc) they obviously preferred Big Data
approach
|
|
Non-volatile
|
Previous
data never erase when new dara added to it.
|
Previous
data neververasse when new data added. stoed as a file which represent a
table . But here sometimes in case of streaming directly use Hive or Spark as
operation enviroment
|
|
Time
- variant
|
The
Data collected in data warehouse is actually identified by a particular time
period. AS it mainly holds historical data for an analytical report
|
Big
data have lot of approach to identified already loaded data , a time period
is one of the approach on it .
|
|
Distributed
file system
|
Processing
of huge data in data warehousing in really time –consuming and sometimes it
took an entire day for complete the process
|
This
is one of the utility of Big data . HDFS mainly defined to
Load
huge data in distributed systems by using map reduce program
|
SQL
statements, prepared statements, Callable statements
Statement
Use
this for general-purpose access to your database. Useful when you are using
static SQL statements at runtime. The Statement interface cannot accept
parameters.
Prepared
Statement
Use
this when you plan to use the SQL statements many times. The Prepared Statement
interface accepts input parameters at runtime.
Callable
Statement
Use
this when you want to access the database stored procedures. The Callable Statement
interface can also accept runtime input parameters.
POJO
in Java
POJO
stands for “Plain Old Java Object” — it’s a pure data structure that has fields
with getters and possibly setters, and may override some methods from Object
(e.g. equals) or some other interface like Serializable, but does not have
behavior of its own. It’s the Java equivalent of a C struct,
For
example, this is a POJO:
class
Point {
private double x;
private double y;
public double getX() { return x; }
public double getY() { return y; }
public void setX(double v) { x = v; }
public void setY(double v) { y = v; }
public boolean equals(Object other) {...}
}
As
soon as you start adding methods that operate on points, like vector addition
or complex multiplication, you no longer have a POJO.
POJOs
can have all of their methods defined automatically based on their field names
and types — IDEs can do this for you, but the most elegant way is to use the
annotations defined by Project Lombok:
@Data
class
Point {
private double x;
private double y;
}
Java
Bean
A
Java Bean is a java class that should follow following conventions:
· It should have a
no-arg constructor.
· It should be
Serializable.
· It should
provide methods to set and get the values of the properties, known as getter
and setter methods.
According
to Java white paper, it is a reusable software component. A bean encapsulates
many objects into one object, so we can access this object from multiple
places. provides easy maintenance.
//Employee.java
package
mypack;
public
class Employee implements java.io.Serializable{
private
int id;
private
String name;
public
Employee(){}
public
void setId(int id){this.id=id;}
public
int getId(){return id;}
public
void setName(String name){this.name=name;}
public
String getName(){return name;}
}
To
access the java bean class, we should use getter and setter methods.
package
mypack;
public
class Test{
public
static void main(String args[]){
Employee
e=new Employee();//object is created
e.setName("Arjun");//setting
value to the object
System.out.println(e.getName());
}}
JPA
Java
application programming interface specification that describes the management
of relational data in applications using Java Platform, Standard Edition and
Java Platform, Enterprise Edition.
Persistence
in this context covers three areas:
the
API itself, defined in the javax.persistence package
the
Java Persistence Query Language (JPQL)
object/relational
metadata
differences
between POJO and Java Bean
|
POJO
|
JAVA BEAN
|
|
It doesn’t have special restrictions other than those forced by Java
language.
|
It is a special POJO which have some restrictions.
|
|
It doesn’t provide much control on members.
|
It provides complete control on members.
|
|
It can implement Serializable interface.
|
It should implement serializable interface.
|
|
Fields can be accessed by their names.
|
Fields are accessed only by getters and setters.
|
|
Fields can have any visiblity.
|
Fields have only private visiblity.
|
|
There can be a no-arg constructor.
|
It must have a no-arg constructor.
|
|
It is used when you don’t want to give restriction on your members and
give user complete access of your entity
|
It is used when you want to provide user your entity but only some
part of your entity.
|
ORM
Object-relational
mapping (ORM) is a mechanism that makes it possible to address, access and
manipulate objects without having to consider how those objects relate to their
data sources. ORM lets programmers maintain a consistent view of objects over
time, even as the sources that deliver them, the sinks that receive them and
the applications that access them change.
Based
on abstraction, ORM manages the mapping details between a set of objects and
underlying relational databases, XML repositories or other data sources and
sinks, while simultaneously hiding the often changing details of related
interfaces from developers and the code they create.
ORM
hides and encapsulates change in the data source itself, so that when data
sources or their APIs change, only ORM needs to change to keep up—not the
applications that use ORM to insulate themselves from this kind of effort. This
capacity lets developers take advantage of new classes as they become available
and also makes it easy to extend ORM-based applications. In many cases, ORM
changes can incorporate new technology and capability without requiring changes
to the code for related applications.
ORM
Tools use in different Platforms
Java
· ActiveJDBC, Java
implementation of Active record pattern, inspired by Ruby on Rails
· ActiveJPA, open-source
Java ORM JPA-like implementation of Active record pattern
· Apache Cayenne, open-source
for Java
· Apache Gora, open-source
software framework provides an in-memory data model and persistence for big
data focused on NoSQL and SQL stores
· Athena Framework, open-source
Java ORM, native support for multitenancy SaaS and remoting to Adobe Flex
· Carbonado, open-source
framework, backed by Berkeley DB or JDBC
· DataNucleus, open-source
JDO and JPA implementation (formerly known as JPOX)
· Ebean, open-source
ORM framework
· EclipseLink, Eclipse
persistence platform
· Enterprise JavaBeans
(EJB)
· Enterprise
Objects Framework,
Mac OS X/Java, part of Apple WebObjects
· Kundera, open-source
framework, JPA compliant, polyglot object-datastore mapping library for NoSQL
datastores
· MyBatis, free
open-source, formerly named iBATIS
· QuickDB ORM,
open-source ORM framework
· Speedment, an open source
stream ORM
· TopLink by Oracle
· Torque, an object-relational
mapper for Java
PHP
· CakePHP, ORM and
framework for PHP 5, open source (scalars, arrays, objects); based on database
introspection, no class extending
· CodeIgniter, framework that
includes an ActiveRecord implementation
· Doctrine, open source
ORM for PHP 5.2.3, 5.3.X. Free software (MIT)
· FuelPHP, ORM and
framework for PHP 5.3, released under the MIT license. Based on the
ActiveRecord pattern.
· Laravel, framework that
contains an ORM called "Eloquent" an ActiveRecord implementation.
· Maghead, a database
framework designed for PHP7 includes ORM, Sharding, DBAL, SQL Builder tools
etc. free software, released under MIT license.
· Propel, ORM and
query-toolkit for PHP 5, inspired by Apache Torque, free software, MIT
· Qcodo, ORM and
framework for PHP 5, open source
· QCubed, A community
driven fork of Qcodo
· Rocks, open
source ORM for PHP 5.1 plus, free for non-commercial use, GPL
.NET
· Base One
Foundation Component Library, free or commercial
· DatabaseObjects .NET, open
source
· DataObjects.NET, commercial
· Dapper, open source
· ECO, commercial but
free use for up to 12 classes
· Entity Framework, included in
.NET Framework 3.5 SP1 and above
· iBATIS, free open
source, maintained by ASF but now inactive.
· LINQ to SQL, included in
.NET Framework 3.5
· Neo, open source
but now inactive.
Python
· Django, ORM included
in Django framework, open source
· SQLAlchemy, open source
· SQLObject, open source
· Storm, open source
(LGPL 2.1) developed at Canonical Ltd.
· Tryton, open source
· web2py, the facilities
of an ORM are handled by the DAL in web2py, open source
· Odoo - Formerly
known as OpenERP, It is an Open Source ERP in which ORM is included
Information
Retrieval (IR)
1.
Information
Retrieval is understood as a fully automatic process that responds to a user
query by examining a collection of documents and returning a sorted document
list that should be relevant to the user requirements as expressed in the
query. Learn more in: Searching Health Information in Question-Answering
Systems
2.
The
activity of obtaining information resources relevant to an information need
from a collection of information resources. Searches can be based on metadata
or on full-text indexing. Learn more in: Linkage Discovery with Glossaries
3.
The
scientific discipline that deals with the representation, organization, storage
and maintenance of information objects and in particular textual objects. The
representation and organization of the information items should provide the
user with easy access to the relevant information and satisfy the user’s
various information needs. Learn more in: Indexing and Compressing Text
4.
Information
retrieval is concerned with the representation and knowledge and subsequent
search for relevant information within these knowledge sources. Information
retrieval provides the technology behind search engines. Learn more in: Text
Mining
5.
The
resource or document discovery from the Web. Learn more in: The State of the
Art in Web Mining
6.
The
activity of obtaining information resources relevant to an information need
from a collection of information resources. Searches can be based on metadata
or on full-text indexing. Learn more in: Information Retrieval by Linkage
Discovery
7.
Implementation
of tools to assist the user in research to find important information
concerning business issues. Learn more in: Information Architecture: Case Study
8.
The
process of bringing down to the relevant information from various information
resources and it can be of text retrieval, image retrieval etc. Learn more in:
An Insight Into Deep Learning Architectures
9.
Information
retrieval (IR) is finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need from within large
collections (usually stored on computers). Learn more in: TempClass: Implicit
Temporal Queries Classifier
10.
Field
of information technology whose aim is to provide techniques to process queries
for extracting information from corpus. Learn more in: Semantic Measures
References
https://en.wikipedia.org/wiki/Persistent_data
https://www.linkedin.com/pulse/what-persistent-data-why-important-c-thomas-tom-smith-iii
https://searchdatamanagement.techtarget.com/definition/data
https://www.guru99.com/introduction-to-database-sql.html
https://www.techopedia.com/definition/441/database-server
https://www.techopedia.com/definition/24361/database-management-systems-dbms
https://www.techopedia.com/definition/7199/file
https://dzone.com/articles/which-is-better-saving-files-in-database-or-in-
https://www.digitalocean.com/community/tutorials/an-introduction-to-big-data-concepts-and-terminology
https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
https://www.mongodb.com/nosql-explained
https://en.wikipedia.org/wiki/Centralized_database
https://www.simplilearn.com/cloud-databases-across-the-globe-article
https://www.educba.com/big-data-vs-data-warehouse/
https://www.quora.com/What-is-POJO-in-Java
https://www.javatpoint.com/java-bean
https://en.wikipedia.org/wiki/Java_Persistence_API
https://www.geeksforgeeks.org/pojo-vs-java-beans/
https://searchwindevelopment.techtarget.com/definition/object-relational-mapping
https://en.wikipedia.org/wiki/List_of_object-relational_mapping_software
https://www.igi-global.com/dictionary/searching-health-information-question-answering/14470