Data Management Blog

Price/Performance of HANA, Exadata, Teradata, and Greenplum

Posted by ES in Big Data, Teradata on 2013/03/26

Price/Performance of HANA, Exadata, Teradata, and Greenplum.

HANA has the lowest latency because it is in-memory. When there are two nodes a penalty is paid for crossing the network… this makes sense.

Exadata does well because the X3 product has SSD cache and I assumed an 80% hit ratio.

Teradata does a little worse because I assumed a lower hit ratio (they have less SSD per TB of data).

Greenplum does worse as they do all I/O against disks.

Leave a comment

Teradata Complex CASE_N MLPPI

Posted by ES in Big Data, Teradata on 2013/03/26

In addition to RANGE_N partitioning mechanism, CASE_N provides a good choice to distribute data with complex splitting criteria. It is much harder to effectively express the following sample situation with ORACLE’s range-based partitioning syntax.

FACT_WEB_CLICKS is a huge click-stream table
40% of clicks are generated by non-members or unsigned-in sessions
- 50% of the non-member clicks are generated by robots and crawlers
- 40% of the non-member clicks are associated with the home page (PAGE_ID = 12345)
60% of clicks are generated by members or signed-in sessions
- 40% of the member clicks go to the 15 most popular PAGE_IDs
- 50% of the traffic has TRACKING_ID populated, the other 50% does not
- 2% of the member clicks are generated by robots or crawlers
For non-member clicks, negative random MEMBER_ID were generated to avoid PI skewness
Most analytical queries only care about the clicks with MEMBER_ID > 0
PAGE_ID is highly skewed, because the popular pages have a lot more clicks than the long tail
Some analytical queries only care about the clicks with TRACKING_ID > 0
Daily partition at the first level. This is commented out in the sample DDL, so we can map each partition’s row count to “PARTITION ID” later

DDL1: Catch-All Bucket without Explicit Criteria

 CREATE MULTISET TABLE FACT_WEB_CLICKS ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     FREESPACE = 0 PERCENT,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      WEB_CLICK_ID BIGINT NOT NULL,
      CLICK_DATE DATE FORMAT 'YYYY-MM-DD',
      CLICK_TIME TIMESTAMP(3),
      MEMBER_ID BIGINT NOT NULL,
      PAGE_ID INTEGER NOT NULL,
      SESSION_ID BIGINT NOT NULL,
      TRACKING_ID INTEGER NOT NULL,
      CLICK_SEQ INTEGER,
      LOCALE_ID INTEGER NOT NULL,
      TOTAL_TIME INTEGER,
      PAGE_VIEW_SEQ INTEGER,
      IS_CRAWLER CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC DEFAULT 'N')
PRIMARY INDEX ( MEMBER_ID )
PARTITION BY (
  --RANGE_N(CLICK_DATE BETWEEN DATE '2012-07-01' AND DATE + 10 EACH INTERVAL '1' DAY ),
  CASE_N(
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'Y',
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N' AND  PAGE_ID =  12345 ,
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N', -- catch_all_bucket for other PAGE_ID(s)

    MEMBER_ID >  0  AND  PAGE_ID  IN (12345 ,12355 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (20001 ,20010 ,20011 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (70001 ,70008 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (80001 ,80002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (90001 ,90002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (60001 ,60002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (50001 ,50002 ),
    MEMBER_ID >  0  -- catch_all_bucket for other PAGE_ID(s)
  ),
  CASE_N(TRACKING_ID <=  0 , TRACKING_ID >  0 )
);

Data Distribution Among Each Partition

select Partition, count(*)
from FACT_WEB_CLICKS
group by 1 order by 1;

The data are distributed among each partition as expected. The 4 catch-all bucket partitions (5,6,21,22) contain the rows considered as “NO CASE OR UNKNOWN”.

The data are distributed among each partition as expected. But the query optimizer cannot take full advantage of the partition elimination yet.

Explain Plan

collect stats FACT_WEB_CLICKS column(partition);
collect stats FACT_WEB_CLICKS index(member_id);
collect stats FACT_WEB_CLICKS column(page_id);
collect stats FACT_WEB_CLICKS column(partition,member_id);

DIAGNOSTIC VERBOSEEXPLAIN ON FOR SESSION;
DIAGNOSTIC HELPSTATS ON FOR SESSION;

explain
select * from FACT_WEB_CLICKS
where member_id > 0
and tracking_id > 0
and page_id in (70001)
and click_date = date '2013-02-05';

Explanation
-----------------------------------------------------------------------
1) First, we lock a distinct fact_web."pseudo table" for read on a
RowHash to prevent global deadlock for fact_web.FACT_WEB_CLICKS.
2) Next, we lock fact_web.FACT_WEB_CLICKS for read.
3) We do an all-AMPs RETRIEVE step from 2 partitions of
fact_web.FACT_WEB_CLICKS with a condition of (
"(fact_web.FACT_WEB_CLICKS.CLICK_DATE = DATE '2013-02-05')
AND ((fact_web.FACT_WEB_CLICKS.PAGE_ID = 70001) AND
((fact_web.FACT_WEB_CLICKS.MEMBER_ID > 0) AND
(fact_web.FACT_WEB_CLICKS.TRACKING_ID > 0)))") into
Spool 1 (group_amps), which is built locally on the AMPs with
Field1 ("UniqueId"). The size of Spool 1 is estimated with no
confidence to be 116,833,856 rows (14,253,730,432 bytes). Spool
Asgnlist:

With the precise filter clause, the explain plan should hit only 1 partition (partition 12 in this case), but the actual plan also wants to scan the catch-all partition 22. There should be a way to improve partition elimination.

DDL2: Catch-All Bucket with Explicit Criteria

 CREATE MULTISET TABLE FACT_WEB_CLICKS2 ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     FREESPACE = 0 PERCENT,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      WEB_CLICK_ID BIGINT NOT NULL,
      CLICK_DATE DATE FORMAT 'YYYY-MM-DD',
      CLICK_TIME TIMESTAMP(3),
      MEMBER_ID BIGINT NOT NULL,
      PAGE_ID INTEGER NOT NULL,
      SESSION_ID BIGINT NOT NULL,
      TRACKING_ID INTEGER NOT NULL,
      CLICK_SEQ INTEGER,
      LOCALE_ID INTEGER NOT NULL,
      TOTAL_TIME INTEGER,
      PAGE_VIEW_SEQ INTEGER,
      IS_CRAWLER CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC DEFAULT 'N')
PRIMARY INDEX ( MEMBER_ID )
PARTITION BY (
  --RANGE_N(CLICK_DATE BETWEEN DATE '2012-07-01' AND DATE + 10 EACH INTERVAL '1' DAY ),
  CASE_N(
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'Y',
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N' AND  PAGE_ID =  12345 ,
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N' AND  PAGE_ID <> 12345 , -- catch_all_bucket for other PAGE_ID(s)

    MEMBER_ID >  0  AND  PAGE_ID  IN (12345 ,12355 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (20001 ,20010 ,20011 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (70001 ,70008 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (80001 ,80002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (90001 ,90002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (60001 ,60002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (50001 ,50002 ),
    MEMBER_ID >  0  AND  PAGE_ID not in (12345,12355,20001,......,50002)  -- catch_all_bucket for other PAGE_ID(s)
  ),
  CASE_N(TRACKING_ID <=  0 , TRACKING_ID >  0 )
);

explain
select * from FACT_WEB_CLICKS2
where member_id > 0
and tracking_id > 0
and page_id in (12345)
and click_date = date '2013-02-05';

 Explanation
 --------------------------------------------------------------------------
   1) First, we lock a distinct fact_web."pseudo table" for read on a
      RowHash to prevent global deadlock for SCRATCH_DS.FACT_WEB_CLICKS2.
   2) Next, we lock fact_web.FACT_WEB_CLICKS2 for read.
   3) We do an all-AMPs RETRIEVE step from a single partition of
      fact_web.FACT_WEB_CLICKS2 with a condition of (
      "(fact_web.FACT_WEB_CLICKS2.PAGE_ID = 12345) AND
   ......

Now the query hit a single partition. So even though the first DDL achieve exactly the same data distribution among each partition as the second DDL, the following explicit criteria for the catch-all bucket will help optimizer to choose the optimal partition elimination:

PAGE_ID <> 12345
PAGE_ID not in (12345,12355,20001,……,50002)

CASE, CASE_N, DPE, Explain Plan, MLPPI, Partition, Partition Elimination, Teradata

Leave a comment

Informatica Repository Query (part 1)

Posted by ES in ETL/ELT, Informatica on 2013/03/10

How to find all the mappings and sessions which have PARALLEL hints in the SQL Override

SELECT S.SUBJ_NAME, M.MAPPING_NAME, W.WIDGET_NAME, A.WIDGET_ID, W.VERSION_NUMBER,
SUBSTR(A.ATTR_VALUE, 1, 60) ATTR_VALUE
FROM OPB_WIDGET_ATTR A, OPB_WIDGET W, OPB_SUBJECT S, OPB_WIDGET_INST I, OPB_MAPPING M
WHERE A.WIDGET_ID = W.WIDGET_ID
AND W.IS_VISIBLE = 1
AND A.VERSION_NUMBER = W.VERSION_NUMBER
AND A.WIDGET_TYPE IN(2, 3, 11) --Limit to Src/Tgt/Lkp Transformations
AND W.WIDGET_ID = I.WIDGET_ID
AND W.VERSION_NUMBER = I.VERSION_NUMBER
AND I.MAPPING_ID = M.MAPPING_ID
AND I.VERSION_NUMBER = M.VERSION_NUMBER
AND W.SUBJECT_ID = S.SUBJ_ID
AND UPPER(A.ATTR_VALUE) LIKE '%PARALLEL%'
UNION
SELECT S.SUBJ_NAME, W.TASK_NAME||'.'||TI.INSTANCE_NAME TASK_NAME, 'SQL Override' WIDGET_NAME,
A.SESS_WIDG_INST_ID, TI.VERSION_NUMBER, SUBSTR(A.ATTR_VALUE, 1, 60) ATTR_VALUE
FROM OPB_SWIDGET_ATTR A, OPB_TASK_INST TI, OPB_TASK W, OPB_SUBJECT S
WHERE A.SESSION_ID = TI.TASK_ID
AND A.VERSION_NUMBER = TI.VERSION_NUMBER
AND TI.WORKFLOW_ID = W.TASK_ID
AND TI.VERSION_NUMBER = W.VERSION_NUMBER
AND W.IS_VISIBLE = 1
AND W.SUBJECT_ID = S.SUBJ_ID
AND UPPER(A.ATTR_VALUE) LIKE '%PARALLEL%'
ORDER BY 1, 2, 3, 4;

How to turn on Write Backward Compatible Session Log File for all session tasks in a folder

UPDATE OPB_TASK_ATTR A SET A.ATTR_VALUE=1
WHERE A.ATTR_ID=17 -- WBCSLF
-- 102 Write performance data to repository
-- 108 Collect performance data
-- 105 Enable high precision
AND EXISTS (
SELECT 0 FROM OPB_TASK T, OPB_SUBJECT S
WHERE T.TASK_ID = A.TASK_ID
AND T.VERSION_NUMBER = A.VERSION_NUMBER
AND T.IS_VISIBLE = 1
AND T.SUBJECT_ID = S.SUBJ_ID
AND S.SUBJ_NAME LIKE '%Sample%'
)
-- AND A.INSTANCE_ID = 0 -- Reusable Session Only ;

What are today’s long-running tasks

select -- the SRC_ROWS may look big if joiner is used
T.SUBJECT_AREA, T.INSTANCE_NAME, TRUNC(AVG(END_TIME-START_TIME)*24, 2) RUN_HOUR,
MIN(T.START_TIME) START_TIME,
SUM(L.SRC_SUCCESS_ROWS) SRC_ROWS, SUM(L.TARG_SUCCESS_ROWS) TGT_ROWS
from REP_TASK_INST_RUN T, OPB_SESS_TASK_LOG L
where T.run_err_code=0 and (T.END_TIME-T.START_TIME)&gt;= 1/24
and T.START_TIME &gt;= TRUNC(SYSDATE)-2/24
and T.INSTANCE_ID = L.INSTANCE_ID
GROUP BY T.SUBJECT_AREA, T.INSTANCE_NAME
Order By RUN_HOUR desc;

Inside the Expression Transformation

select S.SUBJ_NAME, W.WIDGET_NAME, F.WIDGET_ID, F.FIELD_NAME, F.FIELD_ID, 
    E.EXPR_ID, E.VERSION_NUMBER, E.EXPRESSION from OPB_WIDGET W, OPB_SUBJECT S, 
    OPB_WIDGET_FIELD F, OPB_WIDGET_EXPR R, OPB_EXPRESSION E
where W.SUBJECT_ID=S.SUBJ_ID and W.WIDGET_ID=F.WIDGET_ID
and W.WIDGET_ID=R.WIDGET_ID AND F.FIELD_ID=R.OUTPUT_FIELD_ID
and W.WIDGET_ID=E.WIDGET_ID AND R.EXPR_ID=E.EXPR_ID
and W.VERSION_NUMBER = F.VERSION_NUMBER
and F.VERSION_NUMBER = R.VERSION_NUMBER
and R.VERSION_NUMBER = E.VERSION_NUMBER
and W.IS_VISIBLE = 1
and w.WIDGET_NAME like 'EXP_SAMPLE%'
order by S.SUBJ_ID, W.WIDGET_ID, F.FIELD_ID;

Which session populates the specific target table

select SUBJECT_AREA, SESSION_NAME,TARGET_NAME,
MAX(SUCC_ROWS) AS ROWS#, TYPE_ID, COUNT(1) SAMPLE#
from (
select SUBJECT_AREA, SESSION_INSTANCE_NAME SESSION_NAME,
TYPE_ID, SUCCESSFUL_AFFECTED_ROWS SUCC_ROWS,
TABLE_NAME TARGET_NAME
from PM_REPO.REP_SESS_TBL_LOG
WHERE TYPE_ID in (2) -- Target Only
and END_TIME &gt;= TRUNC(SYSDATE-40)
and SUCCESSFUL_AFFECTED_ROWS &gt; 0
and TABLE_NAME like :TGT_NAME||'%' ESCAPE '\'
)
Group By SUBJECT_AREA, SESSION_NAME, TARGET_NAME, TYPE_ID
Order By SUBJECT_AREA, TARGET_NAME, SESSION_NAME ;

How to avoid re-generating & re-transporting ABAP program after slightly changing the mapping

-- CR 29233 and 63134 have been opened to have fix this problem.
-- Find the current LAST_SAVED and UTC_LAST_SAVED
select m.Mapping_ID, m.VERSION_NUMBER, m.Last_Saved, m.UTC_Last_Saved,
v.User_ID, v.LAST_SAVED, v.UTC_LAST_SAVED, v.COMMENTS,
p.PROGRAM_NAME, p.INSTALL_TIME, p.VERSION_NUMBER ABAP_VERSION_NUM
from opb_mapping m, OPB_VERSION_PROPS v, opb_program_info p
where m.MAPPING_ID = v.OBJECT_ID
and v.OBJECT_TYPE = 21
and m.SUBJECT_ID = v.SUBJECT_ID
and m.VERSION_NUMBER = v.VERSION_NUMBER
and m.MAPPING_ID = p.MAPPING_ID(+)
and m.SUBJECT_ID = p.SUBJECT_ID(+)
nd m.VERSION_NUMBER = p.VERSION_NUMBER(+)
and m.MAPPING_NAME = :MAP_NAME o
rder by m.version_number;

-- Then modify the LAST_SAVED, UTC_LAST_SAVED in OPB_VERSION_PROPS and OPB_MAPPING first; 
-- Then modify the VERSION_NUMBER in OPB_PROGRAM_INFO if necessary.

--The timsstamp below is just a sample
update OPB_VERSION_PROPS v set last_saved='12/05/2008 09:10:11', UTC_LAST_SAVED=1228363499
where v.OBJECT_ID = :MAP_ID and
version_number = :VER_NUM and
object_type = 21;

update opb_mapping m set last_saved='12/05/2008 09:10:11', UTC_LAST_SAVED=1228363499
where MAPPING_ID = :MAP_ID and version_number = :VER_NUM;

update opb_program_info set VERSION_NUMBER = :VER_NUM
where mapping_id = :MAP_ID and subject_id = :SUB_ID;

Link from EXP_FROM.PORT_NAME to ???

select S.SUBJ_NAME, WF.INSTANCE_NAME ||'.'|| F.FIELD_NAME
FROM_NAME, F.WIDGET_ID, F.FIELD_ORDER, F.FIELD_ID,
WT.INSTANCE_NAME ||'.'|| T.FIELD_NAME TO_NAME, T.WIDGET_ID, T.FIELD_ORDER, T.FIELD_ID
from OPB_WIDGET Z, OPB_WIDGET_INST WF, OPB_WIDGET_INST WT,
OPB_WIDGET_FIELD F, OPB_WIDGET_FIELD T, OPB_WIDGET_DEP D, OPB_SUBJECT S
where Z.SUBJECT_ID = S.SUBJ_ID
and Z.IS_VISIBLE = 1
and Z.WIDGET_ID = F.WIDGET_ID
and Z.WIDGET_ID = WF.WIDGET_ID
and Z.RU_VERSION_NUMBER = WF.VERSION_NUMBER
and WF.REF_VERSION_NUMBER = F.VERSION_NUMBER
and WF.VERSION_NUMBER = D.VERSION_NUMBER
and WF.MAPPING_ID = D.MAPPING_ID
and WF.INSTANCE_ID = D.FROM_INSTANCE_ID
and F.FIELD_ID = D.FROM_FIELD_ID
and D.TO_INSTANCE_ID = WT.INSTANCE_ID
and D.TO_FIELD_ID = T.FIELD_ID
and D.MAPPING_ID = WT.MAPPING_ID
and D.VERSION_NUMBER = WT.VERSION_NUMBER
and WT.WIDGET_ID = T.WIDGET_ID
and WT.REF_VERSION_NUMBER = T.VERSION_NUMBER
and Z.WIDGET_NAME like 'EXP_FROM%'
and F.FIELD_NAME like 'PORT_NAME%'
order by F.FIELD_ID;

How the connection values is set at session level

select x.WORKFLOW_ID , t.TASK_ID, t.TASK_NAME, b.INSTANCE_ID, b.INSTANCE_NAME,
b.SESS_WIDG_INST_ID, x.CONNECTION_NUM, x.CONNECTION_TYPE, x.CONNECTION_ID,
x.CONNECTION_VALUE, c.OBJECT_NAME
from opb_session s, opb_task t, OPB_SWIDGET_INST b, OPB_SESS_CNX_VALS x, opb_cnx c
where c.OBJECT_ID(+) = x.CONNECTION_ID and s.MAPPING_ID = b.MAPPING_ID and
s.SESSION_ID = b.SESSION_ID and s.SESSION_ID = t.TASK_ID and
s.SESSION_ID = x.SESSION_ID and b.SESS_WIDG_INST_ID = x.SESS_WIDG_INST_ID and
t.SUBJECT_ID = :SUBJ_ID and b.INSTANCE_NAME like :WIDGET_NAME -- Source/Target
and t.TASK_NAME like :SESS_NAME
order by t.task_name, b.SESS_WIDG_INST_ID;

Find Transformation Attribute Override at Session Level

select f.SUBJ_NAME Folder, t.task_name SESSION_NAME, i.INSTANCE_NAME,
o.OBJECT_TYPE_NAME, a.*
from opb_swidget_attr a, OPB_SWIDGET_INST i,
OPB_WIDGET w, OPB_OBJECT_TYPE o,
opb_task t, opb_subject f
where a.attr_value in (:VALUE1, :VALUE2, :VALUE3)
and i.SESSION_ID = a.SESSION_ID
and i.SESS_WIDG_INST_ID = a.SESS_WIDG_INST_ID
and i.WIDGET_ID = w.WIDGET_ID
and i.WIDGET_TYPE = w.WIDGET_TYPE
and i.WIDGET_TYPE = o.OBJECT_TYPE_ID
and i.SESSION_ID= t.task_id
and t.SUBJECT_ID= f.subj_id;

List all the Transformations in a given Mapping
Including the transformations within the mapplet

WITH M AS (
select M.SUBJECT_ID, M.MAPPING_ID
from OPB_MAPPING M, OPB_SUBJECT S
where S.SUBJ_ID = M.SUBJECT_ID
and S.SUBJ_NAME LIKE :Folder_Name
and M.MAPPING_NAME like :Mapping_Name
)
select O.OBJECT_TYPE_NAME, W.INSTANCE_NAME, W.COMMENTS
from OPB_WIDGET_INST W, OPB_OBJECT_TYPE O, M
where O.OBJECT_TYPE_ID = W.WIDGET_TYPE
and O.OBJECT_TYPE_NAME NOT IN ('Mapplet')
and W.MAPPING_ID = M.MAPPING_ID
union
select O.OBJECT_TYPE_NAME, W.INSTANCE_NAME, W.COMMENTS
from OPB_WIDGET_INST W, OPB_OBJECT_TYPE O, M
where O.OBJECT_TYPE_ID = W.WIDGET_TYPE
and O.OBJECT_TYPE_NAME NOT IN ('Mapplet', 'Output Transformation', 'Input Transformation')
and W.MAPPING_ID in (
select X.MAPPING_ID
from OPB_WIDGET_INST W, OPB_OBJECT_TYPE O, M, OPB_MAPPING X
where W.MAPPING_ID = M.MAPPING_ID
and O.OBJECT_TYPE_ID = W.WIDGET_TYPE
and O.OBJECT_TYPE_NAME = 'Mapplet'
and X.REF_WIDGET_ID = W.WIDGET_ID
)
order by 1,2 ;

Use Server Level Property to Control Log Count and Error Threshold

update opb_cfg_attr
set attr_value =
case when attr_id = 103 then '$PMSessionLogCount'
when attr_id = 202 then '$PMSessionErrorThreshold'
end
where attr_id in (103,202) and
(config_id, config_type, version_number) in
( select config_id, config_type, max(version_number)
from OPB_CFG_ATTR
group by config_id, config_type
) ;

delete from opb_cfg_attr
where attr_id = 202 and
attr_value = '1' and
(session_id, config_id, version_number) in
( select session_id, config_id, max(version_number)
from opb_cfg_attr
where session_id &gt; 0 and config_type = 72
and attr_id = 202
group by session_id, config_id
) ;

Informatica, Insight, Metadata, Repository

Leave a comment

Latin, UTF8 and UTF16 with Teradata

Posted by ES in Big Data, Teradata on 2013/03/09

Recently I did some data movement between Teradata and HDFS, and felt it might be worth posting some insight about how to deal with Unicode in Teradata. For folks who only deal with ASCII strings, you can skip the whole post now 🙂

[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.

When come to the topic of Unicode in Teradata, the above error is an inevitable subject. DBA might tell you that your row is bigger than 64K bytes. But that’s normally not the case. If the table contains quite some VARCHAR/CHAR(n) columns, here is how Teradata estimate the column size regardless if CHARACTER SET Latin or Unicode is specified in the table DDL :

Session CHARSET =	Estimated Column Size in Bytes for VARCHR(n)/CHAR(n)
Latin / ASCII	n x 1
UTF8	n x 3
UTF16	n x 2

Then Teradata add the other columns byte size together to determine the row size used in the BTEQ/JDBC/ODBC/TPT/FEXP parcels. Even if the row is actually stored on the disk properly within the 64K limit, you still can’t retrieve it if the estimated row size is beyond 64K. This is quite annoying sometimes, because you can store the data but you can’t access it smoothly.

DBA might advise you to shrink the n in VARCHAR(n) or even vertically split the table into 2. But let’s try the following workaround before you restructure your table.

Set Session CharSet to ASCII

This will probably immediately resolve the row size overflow error. But all your French/Spanish/ Chinese/Japanese/Arabic characters will show up as “?”. This is not good!

Set Session CharSet to UTF16

Normally, people think UTF16 will require more space both on disk and over the network protocol, so UTF8 is always the best choice. But in Teradata’s case:

Unicode strings are stored as UTF16 on disk anyway. Yes, space is wasted; that’s why there is an algorithm compression function to just compress UTF16 to UTF8 in version 13. Only version 14 can store UTF8 on disk.
The Row Size estimation is 33% smaller in UTF16 mode than in UTF8 mode for VARCHAR/CHAR/CLOB columns.
- UTF16 mode will require more bytes over network transportation than UTF8
- UTF16 mode will occupy more disk space if you export the query output to a file (binary or vartext)
- but when UTF8 mode bombs, UTF16 may let your queries go through and display the proper characters in your SQL client or BI client

Therefore, UTF16 might be a better default choice for the follow clients:

ODBC for SQL Assistant, Cognos, MicroStrategy, etc.
JDBC for AquaData Studio, SQuirreL SQL Client, etc.
CLIv2 for BTEQ -c utf16 -e utf8

For ETL tools, UTF16 provides better tolerance than UTF8 when non-BMP characters are encountered.

When ETL writer (DML) session is using Teradata ODBC UTF8, the session will reject the records with extended characters (outside the Basic Multilingual Plane). The error message looks like “Failure 6705 An illegally formed character string was encountered during translation”
The same extended characters will be converted to “�” using Teradata TPT UTF8, and the records can be written into database.
Informatica’s implementation for Teradata ODBC has problem in UTF16 mode, please be very cautious.

BTEQ Example

Let’s see a few unicode characters, some of them are BMP, some of them are non-BMP. Stored in UTF8 mode as a Linux text file, some of them (line 3,4,7) are 3 bytes, others (line 1,2,5,6) are 4 bytes.

DDL:
CREATE SET TABLE unicode_test 
(
  ustring VARCHAR(40) CHARACTER SET UNICODE NOT CASESPECIFIC
) NO PRIMARY INDEX; 

$ cat insert8.sql
insert into unicode_test values('1 [ 🐱 ]');
insert into unicode_test values('2 [ 🕓 ]');
insert into unicode_test values('3 [ ℃  ]');
insert into unicode_test values('4 [ ↳  ]');
insert into unicode_test values('5 [ 🏠 ]');
insert into unicode_test values('6 [ 🐎 ]');
insert into unicode_test values('7 [ 赞  ]');

$ bteq -c utf8 -e utf8
# only line 3,4,7 go through; line 1,2,5,6 are rejected
# *** Failure 6705 An illegally formed character string was encountered during translation.

#### UTF16 mode will load the data, but display them as diamond-shape ? #### 
$ bteq -c utf16 -e utf8
BTEQ -- Enter your SQL request or BTEQ command:
run file insert8.sql;

select * from unicode_test order by 1;
*** Query completed. 7 rows found. One column returned.
*** Total elapsed time was 1 second.
ustring               Char2HexInt(ustring)
--------------------- ----------------------------------
1 [ �� ]              00310020005B0020FFFDFFFD0020005D       -- f0 9f 90 b1 (UTF8)
2 [ �� ]              00320020005B0020FFFDFFFD0020005D       -- f0 9f 95 93 (UTF8)
3 [ ℃  ]               00330020005B0020210300200020005D       -- e2 84 83 (UTF8)
4 [ ↳  ]              00340020005B002021B300200020005D       -- e2 86 b3 (UTF8)
5 [ �� ]              00350020005B0020FFFDFFFD0020005D       -- f0 9f 8f a0 (UTF8)
6 [ �� ]              00360020005B0020FFFDFFFD0020005D       -- f0 9f 90 8e (UTF8)
7 [ 赞  ]               00370020005B00208D5E00200020005D       -- e8 b5 9e (UTF8)

###################### without specifying any character set #################
$ bteq 
BTEQ -- Enter your SQL request or BTEQ command:
 *** Logon successfully completed.
 *** Teradata Database Release is 13.10.05.03                   
 *** Teradata Database Version is 13.10.05.03a                    
 *** Transaction Semantics are BTET.
 *** Session Character Set Name is 'ASCII'.
run file insert8.sql;

select ustring, substr(char2hexint(ustring),1,40), octet_length(ustring) 
from scratch_ds.unicode_test order by 1;
 *** Query completed. 7 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

ustring                 Substr(Char2HexInt(ustring),1,40)         Octet_Length(ustring)
----------------------  ----------------------------------------  ---------------------
1 [ 🐱 ]             00310020005B002000D0009F009000B10020005D                     10
2 [ 🕓 ]             00320020005B002000D0009F009500930020005D                     10
3 [ ℃  ]             00330020005B002000E200D700B800200020005D                     10
4 [ ↳  ]             00340020005B002000E2008600B300200020005D                     10
5 [ 🏠 ]             00350020005B002000D0009F008F01610020005D                     10
6 [ 🐎 ]             00360020005B002000D0009F0090008E0020005D                     10
7 [ 赞  ]             00370020005B002000E800B5009E00200020005D                     10

#### But if we query the same result from JDBC or ODBC client, the display is totally messed up ####
 ustring     Char2HexInt(ustring) 
 ----------  ------------------------------------------ 
 1 [ Ð± ]    00310020005B002000D0009F009000B10020005D
 2 [ Ð ]     00320020005B002000D0009F009500930020005D
 3 [ â×¸  ]  00330020005B002000E200D700B800200020005D
 4 [ â³  ]   00340020005B002000E2008600B300200020005D
 5 [ Ðš ]    00350020005B002000D0009F008F01610020005D
 6 [ Ð ]     00360020005B002000D0009F0090008E0020005D
 7 [ èµ  ]   00370020005B002000E800B5009E00200020005D

As you can see from the above sample, Teradata’s UTF16 mode still can’t store those 4-byte and 3-byte unicode beyond BMP, but at least the ETL process will go through. But ORACLE handles these sample characters flawlessly.

Non-BMP Character in HDFS

Those 4-byte non-BMP characters are also causing trouble on HDFS, it can be stored, but can’t be decoded properly at retrieval time. So it will be a good idea to filter them out at the input GUI.

Column Size Calculation in TPT/BTEQ/FEXP /FASTLOAD

When we export/import data to/from files (binary or flat), the following 2 data types need special calculation to define the file schema:

VARCHAR(n)/CHAR(n):
- ASCII mode: VARCHAR/CHAR(n)
- UTF8 mode: VARCHAR/CHAR(n*3)
- UTF16 mode: VARCHAR/CHAR(n*2)
DECIMAL(m,n):
- MaxDecimalDigits = 15: any column with precision > 15 will have to be defined as DECIMAL(15,x) in the file schema
- MaxDecimalDigits = 18: any column with precision > 18 will have to be defined as DECIMAL(18,x) in the file schema
- MaxDecimalDigits = 38: this is not the default setting, so we have to specify it in the script

Open Issue

If we use BTEQ or FASTLOAD to import flat file (or binary data) in ASCII mode without specifying -c UTF8 or -c UTF16, the utility does not conduct any character validation, so the non-BMP and symbol characters can be imported as is, but the string cannot be queried or displayed in JDBC or ODBC client.

With 14.00 and 14.10 available for upgrade, Teradata tries to advise customers to set AcceptReplacementCharacters = FALSE, but NTA2805 states that 14.xx still does not accept as many unicode characters as customers expect. So it seems that AcceptReplacementCharacters = TRUE is still required for a while.

Other Links

http://teradata.weizheng.net/2012/08/character-sets-in-teradata.html
http://hellohemanth.wordpress.com/2011/11/16/special-characters-issue-while-loading-from-oracle-and-sql-server-charcter-data/

BTEQ, MaxDecimalDigits, Teradata, TPT, Unicode, UTF8

Leave a comment

Simple PERL script to remove the extra new line character from delimited flat files

Posted by ES in ETL/ELT on 2012/06/27

When you use BTEQ, SQL*Plus, or Informatica to generate delimited flat file from RDBMS, the new line character embedded within a string/varchar column is quite annoying. You can add REPLACE() function to each string column in your SQL to remove these offending \n, but that’s still painful.

Let’s pipe the output of SQL into the following PERL script, and it will try its best to reconstruct the column by removing the extra new line characters. The assumption is – either the 1st row is in good shape which does not contain extra new line character, or the 1st row is the column header.

#!/usr/bin/perl -w

my $file_name = shift @ARGV;

my ($line_count, $line_fixed, $expected_delimiter_count, $count, $n, $del) = (0,0,0,0,0, '');

my @possible_delimiter = (',',"\t",';','\|', '\cA', '\cZ');

# open STDIN or file
if (defined $file_name) { open INPUTPIPE, "<" , $file_name or die $!; } 
else { open INPUTPIPE, "<&STDIN"; }

# count delimiter in the 1st line
my $line_buffer = <INPUTPIPE>;
$n++;
print $line_buffer; $line_count++; # the 1st line must be in good shape

foreach $del (@possible_delimiter){
	chomp($line_buffer);
	$count = () = ($line_buffer =~ /$del/g);
	print STDERR $count . " ($del) delimiters found in 1st row.\n";
	if ( $count > $expected_delimiter_count ){
		$expected_delimiter_count = $count;
		$delimiter = $del;
	}
}

die "Can't detect delimiter!\n" if $expected_delimiter_count <= 0;

print STDERR $expected_delimiter_count . " ($delimiter) delimiters expected.\n\n";

# read through the rest 
while( $line_buffer = <INPUTPIPE> ){
	$n++;
	$count = () = ($line_buffer =~ /$delimiter/g);
	if ( $count >= $expected_delimiter_count ){
		print $line_buffer;
		$line_count++;
		print STDERR "$n : $count : output \n";
		next;
	} else {
		while( $next_line = <INPUTPIPE> ){
			$n++;
			chomp($line_buffer);
			$line_buffer .= $next_line;
			$count = () = ($line_buffer =~ /$delimiter/g);
			# print STDERR "$n : $count : $line_buffer";
			if ( $count >= $expected_delimiter_count ){
				print $line_buffer;
				$line_count++;
				$line_fixed++;
				
				last;
			}
		} # read the next line
	}
}

# finish up
print STDERR $line_count . " rows output into STDOUT and " . $line_fixed . " rows fixed.\n";
close INPUTPIPE;

Flat File, NEW LINE, PERL, REPLACE

Leave a comment

Informatica Repository Query (part 2)

Posted by ES in Informatica on 2010/01/09

Continue to post a few more queries for Informatica PowerCenter Repository. Part 1 can be found at http://it.toolbox.com/blogs/golden-orbit-blog/dig-into-informatica-powercenter-repository-29027

How to find Teradata MLoad session against a particular table

select U.SUBJ_NAME, F.TASK_NAME WORKFLOW_NAME, T.Instance_name session_name, S.INSTANCE_NAME target_name,
F.subject_id, F.TASK_ID WORKFLOW_ID, s.session_id, s.MAPPING_ID
from OPB_TASK_INST T, OPB_TASK F, OPB_SUBJECT U,
(
  select session_id, sess_widg_inst_id, version_number,
    instance_name, mapping_id,
    rank() over (partition by session_id, sess_widg_inst_id order by version_number desc) sort_id
  from OPB_SWIDGET_INST
  where widget_type = 2
    and UPPER(INSTANCE_NAME) LIKE '%MY_TABLE_NAME%'
) S
where (s.session_id, s.sess_widg_inst_id, s.version_number) in
(
  select session_id, sess_widg_inst_id, max(VERSION_NUMBER) version_number
  from OPB_EXTN_ATTR e
  where object_type = 79 and object_subtype = 315000 -- MLoad Extension
  group by SESSION_ID, sess_widg_inst_id
)
 and s.SORT_ID = 1 -- only the current version
 and s.SESSION_ID = T.task_id
 and s.VERSION_NUMBER = T.VERSION_NUMBER
 and T.WORKFLOW_ID = F.TASK_ID
 and T.VERSION_NUMBER = F.VERSION_NUMBER
 and F.subject_id = U.SUBJ_ID
order by 1,2,3;

-- It will be nice to see workflow name in View Point
update OPB_EXTN_ATTR
set ATTR_VALUE = 'SubjectArea=$PMFolderName; Workflow=$PMWorkflowName; Session=$PMSessionName; WorkflowRunId=$PMWorkflowRunId;'
where attr_id = 26 -- Query Band Expression
and object_type = 79  -- Target
and OBJECT_SUBTYPE = 315000 -- Teradata Plug-in
and (SESSION_ID, SESS_WIDG_INST_ID, VERSION_NUMBER) in 
(
    select session_id, sess_widg_inst_id, max(VERSION_NUMBER) version_number
    from OPB_EXTN_ATTR e
    where object_type = 79 and object_subtype = 315000
    group by SESSION_ID, sess_widg_inst_id
);

update OPB_EXTN_ATTR
set ATTR_VALUE = '1' 
where attr_id = 10 -- Check "Drop Log/Error Tables" option
and object_type = 79  -- Target
and OBJECT_SUBTYPE = 315000 -- Teradata Plug-in
and ATTR_VALUE = '0';  -- Option is not set yet

How to view all the expressions in a transformation

select wf.WIDGET_ID, wf.FIELD_ID, wf.FIELD_NAME, e.EXPR_ID, e.LINE_NO, e.EXPRESSION
from OPB_EXPRESSION e, OPB_WIDGET_FIELD wf, OPB_WIDGET_EXPR we
where wf.WIDGET_ID = e.WIDGET_ID
  and wf.WIDGET_ID = we.WIDGET_ID
  and wf.FIELD_ID = we.OUTPUT_FIELD_ID
  and we.EXPR_ID = e.EXPR_ID
  and wf.VERSION_NUMBER = we.VERSION_NUMBER
  and we.VERSION_NUMBER = e.VERSION_NUMBER 
  and (wf.WIDGET_ID, wf.VERSION_NUMBER) =  (
    select w.widget_id, max(w.version_number) version_number --, w.widget_type, m.mapping_id
    from OPB_SUBJECT s, OPB_MAPPING m, OPB_WIDGET w, OPB_WIDGET_INST wi
    where m.subject_id = s.subj_id
      and m.mapping_id = wi.mapping_id
      and w.widget_id = wi.WIDGET_ID
      and w.subject_id = m.subject_id
      and s.SUBJ_NAME = 'folder_name1' 
      and m.MAPPING_NAME = 'm_mapping_name1'
      and w.WIDGET_NAME = 'transformation_name1'
    group by w.widget_id
  )

Make sure “Insert Else Update” and “Update Else Insert” are checked for Dynamic Lookup

update OPB_WIDGET_ATTR
set ATTR_VALUE = '1'
where (WIDGET_ID, WIDGET_TYPE, VERSION_NUMBER) in
( select w.WIDGET_ID, w.WIDGET_TYPE, max(w.VERSION_NUMBER) VERSION_NUMBER
  from opb_widget w, opb_widget_attr wa
  where wa.ATTR_ID = 15
    and wa.ATTR_VALUE = '1' -- Dynamic Lookup Cache
    and w.WIDGET_ID = wa.WIDGET_ID
    and w.WIDGET_TYPE = wa.WIDGET_TYPE
    and w.VERSION_NUMBER = wa.VERSION_NUMBER
    and w.WIDGET_TYPE = 11 -- Lookup
  group by w.WIDGET_ID, w.WIDGET_TYPE
)
and ATTR_ID in (18,19) -- Insert Else Update, Update Else Insert
and LINE_NO = 1
and ATTR_VALUE = '0';

select w.WIDGET_NAME, w.VERSION_ID, w.WIDGET_ID,
  wa.ATTR_ID, a.ATTR_NAME, wa.LINE_NO, wa.ATTR_VALUE 
from opb_widget w, opb_widget_attr wa, opb_attr a
where w.WIDGET_NAME like ??? -- put your filter here
  and w.WIDGET_ID = wa.WIDGET_ID
  and w.WIDGET_TYPE = wa.WIDGET_TYPE
  and w.VERSION_NUMBER = wa.VERSION_NUMBER
  and wa.ATTR_ID = a.attr_id
  and wa.WIDGET_TYPE = a.OBJECT_TYPE_ID 
order by w.WIDGET_NAME, w.VERSION_NUMBER desc, wa.ATTR_ID, wa.LINE_NO;

Informatica, Repository, Teradata

Leave a comment

Price/Performance of HANA, Exadata, Teradata, and Greenplum

Teradata Complex CASE_N MLPPI

DDL1: Catch-All Bucket without Explicit Criteria

Data Distribution Among Each Partition

Explain Plan

DDL2: Catch-All Bucket with Explicit Criteria

Informatica Repository Query (part 1)

How to find all the mappings and sessions which have PARALLEL hints in the SQL Override

How to turn on Write Backward Compatible Session Log File for all session tasks in a folder

What are today’s long-running tasks

Inside the Expression Transformation

Which session populates the specific target table

How to avoid re-generating & re-transporting ABAP program after slightly changing the mapping

Link from EXP_FROM.PORT_NAME to ???

How the connection values is set at session level

Find Transformation Attribute Override at Session Level

List all the Transformations in a given Mapping Including the transformations within the mapplet

Use Server Level Property to Control Log Count and Error Threshold

Latin, UTF8 and UTF16 with Teradata

[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.

Set Session CharSet to ASCII

Set Session CharSet to UTF16

BTEQ Example

Non-BMP Character in HDFS

Column Size Calculation in TPT/BTEQ/FEXP /FASTLOAD

Open Issue

Other Links

Simple PERL script to remove the extra new line character from delimited flat files

Informatica Repository Query (part 2)

How to find Teradata MLoad session against a particular table

How to view all the expressions in a transformation

Make sure “Insert Else Update” and “Update Else Insert” are checked for Dynamic Lookup

Archives

List all the Transformations in a given Mapping
Including the transformations within the mapplet